An Architecture for Content Analysis of Documents
and its Use in Information and Knowledge Management Tasks

by Branimir K. Boguraev, Christopher Kennedy, and Sascha Brawer

ACM SIGCHI Bulletin, Volume 30, Number 2, April 1998
ISSN 0736-6906 · http://www.acm.org/sigchi/bulletin/1998.2/boguraev1.html

Abstract

We present a generalised architecture for document content management, with particular emphasis on component functionalities and reconfigurability for different content management tasks. Natural language technologies are encapsulated in separate modules, which then can be customised and tailored for the specific requirements of the type of document, depth of analysis, and detail of output representation, of different document analysis systems. The versatility of the architecture is illustrated by configuring it for two diverse tasks: analysing technical manuals to instantiate databases for on-line assistance, and deriving topically-rich abstractions of content of arbitrary news stories.

Introduction

The natural language research program in Apple's Advanced Technologies Group (ATG) has been actively pursuing the automation of certain aspects of the information analysis and knowledge management tasks. The program's focus is on establishing a core set of natural language processing (NLP) technologies and defining application areas for these. Representative projects have investigated a range of issues including: optimal packaging of a substrate of NLP functionalities, with appropriate API's, embedded within the Macintosh Operating System (Mac OS); an architecture for text processing, configurable for different content analysis applications; studies of how NL technologies can be leveraged for further enhancing the user experience; and building several information management systems incorporating linguistic processing of text-based documents.

In particular, language-related work at ATG facilitates a number of information management tasks, including: semantic highlighting and indexing, topic identification and tracking, content analysis and abstraction, document characterisation, and partial document understanding. Given the broad base of Apple users, the emphasis has been on finding suitable tasks which can be enhanced by linguistic functionalities, on striking the right balance of scalable and robust technologies which can reliably analyse realistic text sources, and on developing algorithms for focused semantic analysis starting from a relatively shallow syntactic base.

This article highlights the core capabilities of an architecture for content analysis, within which a number of information processing applications have been implemented. The use of the architecture for application building is illustrated by two examples: domain acquisition and document abstraction.