Summarization of XML Documents Kondreddi Sarath Kumar.

Slides:



Advertisements
Similar presentations
From Words to Meaning to Insight
Advertisements

Presentation at Society of The Query conference, Amsterdam November 13-14, 2009 (original title: Learning from Google: software design as a methodology.
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
eClassifier: Tool for Taxonomies
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
The North American Carbon Program Google Earth Collection Peter C. Griffith, NACP Coordinator; Lisa E. Wilcox; Amy L. Morrell, NACP Web Group Organization:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download University of Utah.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Evaluating Search Engine
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Memory-Based Recommender Systems : A Comparative Study Aaron John Mani Srinivasan Ramani CSCI 572 PROJECT RECOMPARATOR.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Final Review Sunday March 13th. Databases –Entities/Rows –Attributes/Columns –Keys –Relationships –Schema –Instance.
MovieGEN: A Movie Recommendation System
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
CS 430 / INFO 430 Information Retrieval
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
Preparation, Shooting and Assembly. Preparation: Pre-Production Funding is more or less secure and script is solid enough for production, filmmakers can.
VideoProduction Week 1 © Copyright Queensland School of Film & Television 2010.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
ECE 8443 – Pattern Recognition ECE 3822 – Software Tools For Engineers Topics: Definitions and Terminology Abstraction Schema Tables Types of Databases.
ON INCENTIVE-BASED TAGGING Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung {xyang2, ckcheng, lymo, kao, The University.
Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.
Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Prof. Thomas Sikora Technische Universität Berlin Communication Systems Group Thursday, 2 April 2009 Integration Activities in “Tools for Tag Generation“
CSC Intro. to Computing Lecture 10: Databases.
Se7en opening title sequence timeline
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
M4 / September Integrating multimodal descriptions to index large video collections M4 meeting – Munich Nicolas Moënne-Loccoz, Bruno Janvier,
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Multi-Aspect Query Summarization by Composite Query Date: 2013/03/11 Author: Wei Song, Qing Yu, Zhiheng Xu, Ting Liu, Sheng Li, Ji-Rong Wen Source: SIGIR.
Statistical Properties of Text
Implementing Automatic Value Extraction from Structured Web Pages Varun Ganapathi, Jonathan Pines, Josh Wiseman.
MOVIES QUESTION: If you were stranded on a deserted island (but had power, a DVD player, etc.) what are the three movies you would want with you?
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Allure – tackiness Film genre Epic. 2. What probably caused the industry to gradually move to Hollywood? 3. What relation is there between the movie industry.
PRODUCTION ROLES FEATURE FILM.
Text Based Information Retrieval
RECENT TRENDS IN METADATA GENERATION
LOG ON to your computers and go to stukonis. wikispaces
LECTURE 34: Database Introduction
ISWC 2013 Entity Recommendations in Web Search
Personalized Celebrity Video Search Based on Cross-space Mining
Introduction to Information Retrieval
Chapter 8: The Production Process
Chapter 22, Part
LECTURE 33: Database Introduction
Information Retrieval and Web Design
Presentation transcript:

Summarization of XML Documents Kondreddi Sarath Kumar

Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.User Evaluation V.Xoom tool and few example summaries VI.Conclusion

Motivation XML Document Collection (eg: IMDB) XML Document Types of XML Document Summaries 1)Generic summary – summarizes entire contents of the document. 2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverage

Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverageChallenges Structure is as important as text

Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverageChallenges Structure is as important as text Varying text length

System for XML Summarization Info Unit Generator SUMMARY GENERATOR RANKING UNIT Tag Ranker Text Ranker Corpus Statistics Tag Units Text Units Summary Size Ranked Tag units Ranked Text units Summary XML Doc

Information Units of an XML Document

Tag - Regarded as metadata - Can be highly redundant

Information Units of an XML Document Tag - Regarded as metadata - Can be highly redundant Text - Instance for the tag - Much less redundant - Have different sizes

Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document

Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document

II. Text Ranking Two categories of text 1)Entities 2)Regular text

Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

Correlated tags and text Often find related tag units – siblings of each other E.g.: Actor and Role Inclusion Principle Case 1 : Case 2 : Letand

Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615

Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615 Distribute the remaining “tag- budget” by re-normalizing the distribution of available tags

StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor keyword trivia Total23 Round 2 2.1actor (20) -keyword0000 (2) 2.2trivia (8) Total30 Generating the summary with 30 tags

User Evaluation DatasetNo of files No of unique tags No of documents used for evaluation Movie200, People150, Sizealpha Movie , 0.8 1, 0.8, 0.6 People , 0.6 Total64+16 = 80 Automatically generated summaries (80) have been mixed with human-generated summaries (32) Summaries graded using a scale of 1-7 where 1 – extremely bad & 7 – perfect Six different evaluators – each summary evaluated by at least three

User Evaluation DatasetSizealpha Total (across alpha) Movie /8 (100%) 7/8 (87.5%) 5/8 (62.5%) 7/8 (87.5%) - 1/8 (12.5%) 4/8 (50%) 13/16 (81.25%) 16/24 (66.6%) 18/24 (75%) Total (across sizes) 23/24 (95.8%)19/24 (79.1%) 5/16 (31.2%) 47/64 (73.4%) People /4 (75%) 4/4 (100%) /4 (62.5%) 4/4 (100%) 4/8 (50%) 8/8 (100%) Total (across sizes) 7/8 (87.5%)-5/8 (62.5%)12/16 (75%) Tabulation of average and above average grades (4-7) Note: Grades shown only if at least 2 evaluators agreed on it.

Xoom A tool for exploring and summarizing XML documents Exploration Mode

Xoom Summarization Mode - Titanic.xml

Conclusion A fully automated XML summary generator Ranking of tags and text based on the ranking model Generation of summary from ranked tags & text within memory budget Xoom – a tool for exploring and summarizing XML documents User Evaluation

Publications Xoom: A tool for zooming in and out of XML Documents (Demo) Maya Ramanath and Kondreddi Sarath Kumar Proc. of Intl. Conf. on Extending Database Technology (EDBT), St. Petersburg, Russia, March 2009 A Rank-Rewrite Framework for Summarizing XML Documents Maya Ramanath and Kondreddi Sarath Kumar 2nd Intl. Workshop on Ranking in Databases (DBRank, in conjunction with ICDE 2008), Cancun, Mexico, April 2008 User Evaluation of Summaries Link:

Thanks!

Appendix Informativeness

Coverage

Why not tag-text pairs?

Ocean’s Eleven.xml - Summaries

Titanic.xml on OST Summarizer Gern Drowning man Martin, Johnny (I) Rescue boat crewman Lynch, Don (II) Frederick Spedden Cameron, James (I) Cameo appearance (steerage dancer) Cragnotti, Chris Victor Giglio Kenny, Tony (I) Deckhand Campolo, Bruno Second-class man Abercrombie, Ian adr loop group uncredited Allen, Melinda assistant: James Cameron uncredited Altman, John (I) historical music advisor Altman, John (I) music arranger: period music Amorelli, Mike rigging gaffer Amorelli, Paul rigging best boy electric Anaya, Daniel grip Andrade, Maria Louise costumer Baker, Brett photo double: Leonardo DiCaprio Arvizu, Ricardo grip Bailes, Tim marine consultant Arneson, Charlie aquatic researcher Arneson, Charlie aquatic supervisor Arnold, Amy key set costumer: women Atkinson, Lisa (I) pre-production consultant Barius, Claudette additional still photographer: pre-production uncredited Baker, Jeanie costumer Barton, Roger associate editor Baker, Tom (VI) electrician Bass, Andy (I) assistant music engineer Barber, Jamie (I) first assistant camera: Halifax Baylon, Hugo location assistant Bee, Guy Norman camera operator Benarroch, Ariel first assistant camera: second unit uncredited Bendt, Tony company grip Boccoli, Daniel apprentice editor Botham, Buddy generator operator Bonner, Kit naval consultant Blevins, Cha costumer as Deborah 'Cha' Blevins Bloom, Kirk second assistant camera Bolton, Paul electrician Bornstein, Bob music preparation Bozeman, Marsha costumer Broberg, David first assistant film editor Brady, Kenneth Patrick production assistant Bruno, Keri production assistant Bryan, Mitch (III) assistant video assist operator Bryce, Malcolm lamp operator Burdick, Geoff production associate Buckley, John (III) gaffer Cameron, James (I) director of photography: Titanic deep dive camera Cameron, James (I) special camera equipment designer Cameron, Michael (II) special deep ocean camera system Byall, Bruce grip Byron, Carol Sue additional production accountant uncredited Canedo, Luis rigging electrician as Jose

DatasetFilename MovieAmerican Beauty Ocean’s Eleven Kill Bill Part II Saving Private Ryan The Last Samurai The Usual Suspects Titanic A Space Odyssey PeopleBrad Pitt Matt Damon Ben Affleck Leonardo DiCaprio User Evaluation of Summaries – IMDB Dataset Files used

User Evaluation of Summaries – IMDB Dataset

Xoom