Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarization of XML Documents Kondreddi Sarath Kumar.

Similar presentations


Presentation on theme: "Summarization of XML Documents Kondreddi Sarath Kumar."— Presentation transcript:

1 Summarization of XML Documents Kondreddi Sarath Kumar

2 Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.User Evaluation V.Xoom tool and few example summaries VI.Conclusion

3 Motivation XML Document Collection (eg: IMDB) XML Document Types of XML Document Summaries 1)Generic summary – summarizes entire contents of the document. 2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

4 Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverage

5 Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverageChallenges Structure is as important as text

6 Aims We aim at summaries which are : Generated automatically Highly constrained by size Highly informative High coverageChallenges Structure is as important as text Varying text length

7 System for XML Summarization Info Unit Generator SUMMARY GENERATOR RANKING UNIT Tag Ranker Text Ranker Corpus Statistics Tag Units Text Units Summary Size Ranked Tag units Ranked Text units Summary XML Doc

8 Information Units of an XML Document

9 Tag - Regarded as metadata - Can be highly redundant

10 Information Units of an XML Document Tag - Regarded as metadata - Can be highly redundant Text - Instance for the tag - Much less redundant - Have different sizes

11 Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document

12 Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document

13 II. Text Ranking Two categories of text 1)Entities 2)Regular text

14 Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

15 Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

16 Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

17 Correlated tags and text Often find related tag units – siblings of each other E.g.: Actor and Role Inclusion Principle Case 1 : Case 2 : Letand

18 Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615

19 Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615 Distribute the remaining “tag- budget” by re-normalizing the distribution of available tags

20 StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor0.53015 1.2keyword0.3292 1.3trivia0.21566 Total23 Round 2 2.1actor0.7151555 (20) -keyword0000 (2) 2.2trivia0.2851522 (8) Total30 Generating the summary with 30 tags

21 User Evaluation DatasetNo of files No of unique tags No of documents used for evaluation Movie200,000398 People150,000114 Sizealpha Movie 5 10 20 1, 0.8 1, 0.8, 0.6 People 5 10 1, 0.6 Total64+16 = 80 Automatically generated summaries (80) have been mixed with human-generated summaries (32) Summaries graded using a scale of 1-7 where 1 – extremely bad & 7 – perfect Six different evaluators – each summary evaluated by at least three

22 User Evaluation DatasetSizealpha 1.00.80.6Total (across alpha) Movie 5 10 20 8/8 (100%) 7/8 (87.5%) 5/8 (62.5%) 7/8 (87.5%) - 1/8 (12.5%) 4/8 (50%) 13/16 (81.25%) 16/24 (66.6%) 18/24 (75%) Total (across sizes) 23/24 (95.8%)19/24 (79.1%) 5/16 (31.2%) 47/64 (73.4%) People 5 10 3/4 (75%) 4/4 (100%) ---- 1/4 (62.5%) 4/4 (100%) 4/8 (50%) 8/8 (100%) Total (across sizes) 7/8 (87.5%)-5/8 (62.5%)12/16 (75%) Tabulation of average and above average grades (4-7) Note: Grades shown only if at least 2 evaluators agreed on it.

23 Xoom A tool for exploring and summarizing XML documents Exploration Mode

24 Xoom Summarization Mode - Titanic.xml

25

26 Conclusion A fully automated XML summary generator Ranking of tags and text based on the ranking model Generation of summary from ranked tags & text within memory budget Xoom – a tool for exploring and summarizing XML documents User Evaluation

27 Publications Xoom: A tool for zooming in and out of XML Documents (Demo) Maya Ramanath and Kondreddi Sarath Kumar Proc. of Intl. Conf. on Extending Database Technology (EDBT), St. Petersburg, Russia, March 2009 A Rank-Rewrite Framework for Summarizing XML Documents Maya Ramanath and Kondreddi Sarath Kumar 2nd Intl. Workshop on Ranking in Databases (DBRank, in conjunction with ICDE 2008), Cancun, Mexico, April 2008 User Evaluation of Summaries Link: http://www.mpi-inf.mpg.de/~ramanath/Summarization/http://www.mpi-inf.mpg.de/~ramanath/Summarization/

28 Thanks!

29 Appendix Informativeness

30 Coverage

31 Why not tag-text pairs?

32 Ocean’s Eleven.xml - Summaries

33 Titanic.xml on OST Summarizer Gern Drowning man Martin, Johnny (I) Rescue boat crewman Lynch, Don (II) Frederick Spedden Cameron, James (I) Cameo appearance (steerage dancer) Cragnotti, Chris Victor Giglio Kenny, Tony (I) Deckhand Campolo, Bruno Second-class man Abercrombie, Ian adr loop group uncredited Allen, Melinda assistant: James Cameron uncredited Altman, John (I) historical music advisor Altman, John (I) music arranger: period music Amorelli, Mike rigging gaffer Amorelli, Paul rigging best boy electric Anaya, Daniel grip Andrade, Maria Louise costumer Baker, Brett photo double: Leonardo DiCaprio Arvizu, Ricardo grip Bailes, Tim marine consultant Arneson, Charlie aquatic researcher Arneson, Charlie aquatic supervisor Arnold, Amy key set costumer: women Atkinson, Lisa (I) pre-production consultant Barius, Claudette additional still photographer: pre-production uncredited Baker, Jeanie costumer Barton, Roger associate editor Baker, Tom (VI) electrician Bass, Andy (I) assistant music engineer Barber, Jamie (I) first assistant camera: Halifax Baylon, Hugo location assistant Bee, Guy Norman camera operator Benarroch, Ariel first assistant camera: second unit uncredited Bendt, Tony company grip Boccoli, Daniel apprentice editor Botham, Buddy generator operator Bonner, Kit naval consultant Blevins, Cha costumer as Deborah 'Cha' Blevins Bloom, Kirk second assistant camera Bolton, Paul electrician Bornstein, Bob music preparation Bozeman, Marsha costumer Broberg, David first assistant film editor Brady, Kenneth Patrick production assistant Bruno, Keri production assistant Bryan, Mitch (III) assistant video assist operator Bryce, Malcolm lamp operator Burdick, Geoff production associate Buckley, John (III) gaffer Cameron, James (I) director of photography: Titanic deep dive camera Cameron, James (I) special camera equipment designer Cameron, Michael (II) special deep ocean camera system Byall, Bruce grip Byron, Carol Sue additional production accountant uncredited Canedo, Luis rigging electrician as Jose

34 DatasetFilename MovieAmerican Beauty Ocean’s Eleven Kill Bill Part II Saving Private Ryan The Last Samurai The Usual Suspects Titanic A Space Odyssey PeopleBrad Pitt Matt Damon Ben Affleck Leonardo DiCaprio User Evaluation of Summaries – IMDB Dataset Files used

35 User Evaluation of Summaries – IMDB Dataset

36 Xoom


Download ppt "Summarization of XML Documents Kondreddi Sarath Kumar."

Similar presentations


Ads by Google