Download presentation
Presentation is loading. Please wait.
Published byDarrion Scruton Modified over 10 years ago
1
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential Problems –Storage cost: metadata size is huge, –computation cost: computation time is long –Communication cost: metadata transfer time is long –semantic meaning of text: less semantic Goal –Need an more efficient mechanism to represent documents/collections
2
Proposed Approach Sources for metadata generation –Text Summaries vs. full text –Multi-document Summarization on collections Metadata Organization –Topic Hierarchy Automatic metadata generation –Statistical Language Model
3
Document (Text) Summarization Document summarization (DS) –“The process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).” (Mani & Maybury, 1999) –Full text can be reduced to an abstract without losing too much useful information Multi-Document Summarization (MDS) –Work on related documents (same topic) –can capture relations across documents
4
Language Model Language model –An approximation to real language –Try to explain already observed phenomena or future behavior –A probability distribution over strings in a finite alphabet Basic idea when using in IR –Infer a language model for each document –Estimate the probability of generating the query according the models, rather than estimating the probability of relevance each document to the query –Rank the documents according to these probabilities
5
References Inderjeet Mani, Mark T. Maybury. Advances in Automatic Text Summarization. MIT Press.1999 J. Ponte. A Language Modeling Approach to Information Retrieval, In PhD Thesis. Dept. of Computer Science, University of Massachusetts, Amherst, 1998. Michael P. Oakes. Statistics for Corpus Linguistics. Edinburgh University Press. 1998
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.