Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.

Slides:

Advertisements

Similar presentations

Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.

Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Evaluating Search Engine

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Measures of Relative Standing and Boxplots

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.

Summarization of XML Documents Kondreddi Sarath Kumar.

1 Measure of Center  Measure of Center the value at the center or middle of a data set 1.Mean 2.Median 3.Mode 4.Midrange (rarely used)

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Querying Structured Text in an XML Database By Xuemei Luo.

Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.

Chapter 6: Information Retrieval and Web Search

Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

1 Measure of Center  Measure of Center the value at the center or middle of a data set 1.Mean 2.Median 3.Mode 4.Midrange (rarely used)

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates

Calculate Standard Deviation ANALYZE THE SPREAD OF DATA.

© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.

Clustering C.Watters CS6403.

A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Module 12 – Analyzing Data Module 13 – Drawing Conclusions and Documenting Findings Module 14 – Disseminating Information Module 15 – Feedback for Program.

1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.

Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Information Retrieval in Practice

RECENT TRENDS IN METADATA GENERATION

Personalized Social Image Recommendation

10.2 Statistics Part 1.

Introduction to Information Retrieval

Mining Anchor Text for Query Refinement

Presentation transcript:

Summarization of XML Documents K Sarath Kumar

Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example Summaries V.Conclusion and Future Work

Motivation XML Document Collection (eg: IMDB) XML Document Types of XML Document Summaries 1)Generic summary – summarizes entire contents of the document. 2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

Aims We aim at summaries which are : Generated Automatically Highly constrained by size Highly Informative High Coverage Challenges Structure is as important as text Varying text length

System for XML Summarization Info Unit Generator SUMMARY GENERATOR RANKING UNIT Tag Ranker Text Ranker Corpus Statistics Tag Units Text Units Summary Size Ranked Tag units Ranked Text units Summary XML Doc

Information Units of an XML Document Tag - Regarded as metadata - Can be highly redundant - Can be encoded into Schema DTD Text - Instance for the tag - Much less redundant - Have different sizes

Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document

II. Text Ranking Two categories of text 1)Entities 2)Regular text

Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

Correlated tags and text Often find related tag units – siblings of each other E.g.: Actor and Role Inclusion Principle Case 1 : Case 2 : Letand

Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615 Distribute the remaining “tag- budget” by re-normalizing the distribution of available tags

StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor keyword trivia Total23 Round 2 2.1actor (20) -keyword0000 (2) 2.2trivia (8) Total30 Generating the summary with 30 tags

Few Example Summaries Titanic.xml - Summaries

Conclusion A fully automated XML summary generator Ranking of tags and text based on the ranking model Generation of summary from ranked tags & text within memory budget User Evaluation is underway Future Work Rewriting the structure of the xml documents during summarization Possible usage of text summarizers for long text Query-biased xml summary generation

Thanks!

Appendix Informativeness

Coverage

Ranking Model I. TAG RANKER Typicality : How typical is the tag in the corpus? Mixture Model of Typicality and Specialty

Specialty : How unusually frequent/infrequent is the tag in the current document compared to an average document of the corpus?

Text with redundancy in tag context Sort terms by frequencies and take top ‘m’ terms as centroid query Relevance : Similarity : Calculated using Maximum marginal relevance(MMR) Finally,

Text without redundancy in tag context Redundancy at tag level : No redundancy at tag level :is set empirically

A Relative Count Matrix is constructed Given two tags Ti and Tj, the relative importance of Tj with that of higher ranked Tj is calculated by dividing them both by P(Tj|D) (shows how many Tj tags are worth one Ti) Tj is considered only after P(Ti|D)/P(Tj|D) number of Ti tags have been considered. Extending the above concept, a matrix with relative counts can be formed. TagProb. actor0.5 keyword0.3 trivia0.2 actorkeywordtriviaRow- wise Total actor 1--1 keyword 21-3 trivia 3216

Ocean’s Eleven.xml - Summaries

StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor keyword trivia Total17 Round 2 2.1actor (24) -keyword0000 (2) 2.2trivia (4) Total30 Generating the summary with 30 tags