Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba

Similar presentations


Presentation on theme: "Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba"— Presentation transcript:

1 Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at

2 IM = DM? Is Information Management the same as Document Management? –No, because the relevant information may be distributed across several documents, or may only be a small part of a document Then what is information management? –Extraction, storage, indexing and retrieval of information units contained in documents.

3 IM Applications Document Retrieval Routing Question Answering Factual Database Construction Summarisation

4 Document Annotation Document Annotation adds information to documents Annotation Formats: SGML, XML, LaTeX,... Annotation Standards: HTML, NITF, TEI, CES, GDA, Map Task, TreeBank, DublinCore

5 Formal Properties of XML Tree structures nodes with attribute/value pairs node content is a string which can contain XML trees nodes can have identifiers no type hierarchy

6 Language Technologies Think of language technologies as processes that add annotations to documents, based on an analysis of the documents' linguistic content. This point of view allows a uniform treatment of human-generated and LT- generated annotations.

7 Document-Level LT Language Identification Categorisation Summarisation All of these can be applied to parts of documents also.

8 Collection-Level LT Clustering Topic detection and tracking Multi-document summarisation

9 Fine-Grained LT Morphology Part-of-speech Tagging (shallow) parsing coreference resolution information extraction

10 LT and Document Annotation (Annotated) Text Document LT Annotated Text Document

11 Information Retrieval Retrieval of information units in response to an information need How is the information need stated (keywords, questions, examples)? How is the information need represented? How are information units represented? How are the representations matched?

12 How are documents represented? XML trees index of word/phrase occurrences index of relations (represented as feature structures) word, phrase, relation index should have pointers to text locations

13 How are queries represented? Words / phrases relations (expressed as feature structures)

14 How are representations matched? Unification Apparent mismatches between query and representation can be resolved by relaxation of the query. Required inference by forward or backward chaining, as required.

15 Research Issues Relevance ranking for feature-structure based queries Efficient indexing and matching of feature structures is required (  fast unification) Information content (ontologies) to be represented in the formalism


Download ppt "Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba"

Similar presentations


Ads by Google