Download presentation
Presentation is loading. Please wait.
Published byAlicia Bond Modified over 9 years ago
1
Summarization of XML Documents Kondreddi Sarath Kumar
2
Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.User Evaluation V.Xoom tool and few example summaries VI.Conclusion
3
Motivation XML Document Collection (eg: IMDB) XML Document Types of XML Document Summaries 1)Generic summary – summarizes entire contents of the document. 2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.
4
Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverage
5
Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverageChallenges Structure is as important as text
6
Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverageChallenges Structure is as important as text Varying text length
7
System for XML Summarization Info Unit Generator SUMMARY GENERATOR RANKING UNIT Tag Ranker Text Ranker Corpus Statistics Tag Units Text Units Summary Size Ranked Tag units Ranked Text units Summary XML Doc
8
Information Units of an XML Document
9
Tag - Regarded as metadata - Can be highly redundant
10
Information Units of an XML Document Tag - Regarded as metadata - Can be highly redundant Text - Instance for the tag - Much less redundant - Have different sizes
11
Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document
12
Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document
13
II. Text Ranking Two categories of text 1)Entities 2)Regular text
14
Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)
15
Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)
16
Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)
17
Correlated tags and text Often find related tag units – siblings of each other E.g.: Actor and Role Inclusion Principle Case 1 : Case 2 : Letand
18
Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615
19
Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615 Distribute the remaining “tag- budget” by re-normalizing the distribution of available tags
20
StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor0.53015 1.2keyword0.3292 1.3trivia0.21566 Total23 Round 2 2.1actor0.7151555 (20) -keyword0000 (2) 2.2trivia0.2851522 (8) Total30 Generating the summary with 30 tags
21
User Evaluation DatasetNo of files No of unique tags No of documents used for evaluation Movie200,000398 People150,000114 Sizealpha Movie 5 10 20 1, 0.8 1, 0.8, 0.6 People 5 10 1, 0.6 Total64+16 = 80 Automatically generated summaries (80) have been mixed with human-generated summaries (32) Summaries graded using a scale of 1-7 where 1 – extremely bad & 7 – perfect Six different evaluators – each summary evaluated by at least three
22
User Evaluation DatasetSizealpha 1.00.80.6Total (across alpha) Movie 5 10 20 8/8 (100%) 7/8 (87.5%) 5/8 (62.5%) 7/8 (87.5%) - 1/8 (12.5%) 4/8 (50%) 13/16 (81.25%) 16/24 (66.6%) 18/24 (75%) Total (across sizes) 23/24 (95.8%)19/24 (79.1%) 5/16 (31.2%) 47/64 (73.4%) People 5 10 3/4 (75%) 4/4 (100%) ---- 1/4 (62.5%) 4/4 (100%) 4/8 (50%) 8/8 (100%) Total (across sizes) 7/8 (87.5%)-5/8 (62.5%)12/16 (75%) Tabulation of average and above average grades (4-7) Note: Grades shown only if at least 2 evaluators agreed on it.
23
Xoom A tool for exploring and summarizing XML documents Exploration Mode
24
Xoom Summarization Mode - Titanic.xml
26
Conclusion A fully automated XML summary generator Ranking of tags and text based on the ranking model Generation of summary from ranked tags & text within memory budget Xoom – a tool for exploring and summarizing XML documents User Evaluation
27
Publications Xoom: A tool for zooming in and out of XML Documents (Demo) Maya Ramanath and Kondreddi Sarath Kumar Proc. of Intl. Conf. on Extending Database Technology (EDBT), St. Petersburg, Russia, March 2009 A Rank-Rewrite Framework for Summarizing XML Documents Maya Ramanath and Kondreddi Sarath Kumar 2nd Intl. Workshop on Ranking in Databases (DBRank, in conjunction with ICDE 2008), Cancun, Mexico, April 2008 User Evaluation of Summaries Link: http://www.mpi-inf.mpg.de/~ramanath/Summarization/http://www.mpi-inf.mpg.de/~ramanath/Summarization/
28
Thanks!
29
Appendix Informativeness
30
Coverage
31
Why not tag-text pairs?
32
Ocean’s Eleven.xml - Summaries
33
Titanic.xml on OST Summarizer Gern Drowning man Martin, Johnny (I) Rescue boat crewman Lynch, Don (II) Frederick Spedden Cameron, James (I) Cameo appearance (steerage dancer) Cragnotti, Chris Victor Giglio Kenny, Tony (I) Deckhand Campolo, Bruno Second-class man Abercrombie, Ian adr loop group uncredited Allen, Melinda assistant: James Cameron uncredited Altman, John (I) historical music advisor Altman, John (I) music arranger: period music Amorelli, Mike rigging gaffer Amorelli, Paul rigging best boy electric Anaya, Daniel grip Andrade, Maria Louise costumer Baker, Brett photo double: Leonardo DiCaprio Arvizu, Ricardo grip Bailes, Tim marine consultant Arneson, Charlie aquatic researcher Arneson, Charlie aquatic supervisor Arnold, Amy key set costumer: women Atkinson, Lisa (I) pre-production consultant Barius, Claudette additional still photographer: pre-production uncredited Baker, Jeanie costumer Barton, Roger associate editor Baker, Tom (VI) electrician Bass, Andy (I) assistant music engineer Barber, Jamie (I) first assistant camera: Halifax Baylon, Hugo location assistant Bee, Guy Norman camera operator Benarroch, Ariel first assistant camera: second unit uncredited Bendt, Tony company grip Boccoli, Daniel apprentice editor Botham, Buddy generator operator Bonner, Kit naval consultant Blevins, Cha costumer as Deborah 'Cha' Blevins Bloom, Kirk second assistant camera Bolton, Paul electrician Bornstein, Bob music preparation Bozeman, Marsha costumer Broberg, David first assistant film editor Brady, Kenneth Patrick production assistant Bruno, Keri production assistant Bryan, Mitch (III) assistant video assist operator Bryce, Malcolm lamp operator Burdick, Geoff production associate Buckley, John (III) gaffer Cameron, James (I) director of photography: Titanic deep dive camera Cameron, James (I) special camera equipment designer Cameron, Michael (II) special deep ocean camera system Byall, Bruce grip Byron, Carol Sue additional production accountant uncredited Canedo, Luis rigging electrician as Jose
34
DatasetFilename MovieAmerican Beauty Ocean’s Eleven Kill Bill Part II Saving Private Ryan The Last Samurai The Usual Suspects Titanic A Space Odyssey PeopleBrad Pitt Matt Damon Ben Affleck Leonardo DiCaprio User Evaluation of Summaries – IMDB Dataset Files used
35
User Evaluation of Summaries – IMDB Dataset
36
Xoom
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.