Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
IM = DM? Is Information Management the same as Document Management? –No, because the relevant information may be distributed across several documents, or may only be a small part of a document Then what is information management? –Extraction, storage, indexing and retrieval of information units contained in documents.
IM Applications Document Retrieval Routing Question Answering Factual Database Construction Summarisation
Document Annotation Document Annotation adds information to documents Annotation Formats: SGML, XML, LaTeX,... Annotation Standards: HTML, NITF, TEI, CES, GDA, Map Task, TreeBank, DublinCore
Formal Properties of XML Tree structures nodes with attribute/value pairs node content is a string which can contain XML trees nodes can have identifiers no type hierarchy
Language Technologies Think of language technologies as processes that add annotations to documents, based on an analysis of the documents' linguistic content. This point of view allows a uniform treatment of human-generated and LT- generated annotations.
Document-Level LT Language Identification Categorisation Summarisation All of these can be applied to parts of documents also.
Collection-Level LT Clustering Topic detection and tracking Multi-document summarisation
Fine-Grained LT Morphology Part-of-speech Tagging (shallow) parsing coreference resolution information extraction
LT and Document Annotation (Annotated) Text Document LT Annotated Text Document
Information Retrieval Retrieval of information units in response to an information need How is the information need stated (keywords, questions, examples)? How is the information need represented? How are information units represented? How are the representations matched?
How are documents represented? XML trees index of word/phrase occurrences index of relations (represented as feature structures) word, phrase, relation index should have pointers to text locations
How are queries represented? Words / phrases relations (expressed as feature structures)
How are representations matched? Unification Apparent mismatches between query and representation can be resolved by relaxation of the query. Required inference by forward or backward chaining, as required.
Research Issues Relevance ranking for feature-structure based queries Efficient indexing and matching of feature structures is required ( fast unification) Information content (ontologies) to be represented in the formalism