Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.

Summarization of XML Documents K Sarath Kumar

Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example Summaries V.Conclusion and Future Work

Motivation XML Document Collection (eg: IMDB) XML Document Types of XML Document Summaries 1)Generic summary – summarizes entire contents of the document. 2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

Aims We aim at summaries which are : Generated Automatically Highly constrained by size Highly Informative High Coverage Challenges Structure is as important as text Varying text length

System for XML Summarization Info Unit Generator SUMMARY GENERATOR RANKING UNIT Tag Ranker Text Ranker Corpus Statistics Tag Units Text Units Summary Size Ranked Tag units Ranked Text units Summary XML Doc

Information Units of an XML Document Tag - Regarded as metadata - Can be highly redundant - Can be encoded into Schema DTD Text - Instance for the tag - Much less redundant - Have different sizes

Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus? E.g.: Typical tags define the context of the document Occur regularly in most or all of the documents Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? Special tags denote a special aspect of the current document Occurs too many or too few times in the current document than usual Quantified by deviation from average number of occurrences per document

II. Text Ranking Two categories of text 1)Entities 2)Regular text

Tag context Document contextCorpus context Ranking is done based on context of occurrence. - No redundancy in tag context (E.g.: actor names, genre) - Redundancy in tag context (E.g.: plots, goofs, trivia items)

Correlated tags and text Often find related tag units – siblings of each other E.g.: Actor and Role Inclusion Principle Case 1 : Case 2 : Letand

Generation of Summary TagProb. Actor0.5 Keyword0.3 Trivia0.2 Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. TagRequired no. of tags Available no. of tags Actor1530 Keyword92 trivia615 Distribute the remaining “tag- budget” by re-normalizing the distribution of available tags

StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor0.53015 1.2keyword0.3292 1.3trivia0.21566 Total23 Round 2 2.1actor0.7151555 (20) -keyword0000 (2) 2.2trivia0.2851522 (8) Total30 Generating the summary with 30 tags

Few Example Summaries Titanic.xml - Summaries

Conclusion A fully automated XML summary generator Ranking of tags and text based on the ranking model Generation of summary from ranked tags & text within memory budget User Evaluation is underway Future Work Rewriting the structure of the xml documents during summarization Possible usage of text summarizers for long text Query-biased xml summary generation

Thanks!

Appendix Informativeness

Coverage

Ranking Model I. TAG RANKER Typicality : How typical is the tag in the corpus? Mixture Model of Typicality and Specialty

Specialty : How unusually frequent/infrequent is the tag in the current document compared to an average document of the corpus?

Text with redundancy in tag context Sort terms by frequencies and take top ‘m’ terms as centroid query Relevance : Similarity : Calculated using Maximum marginal relevance(MMR) Finally,

Text without redundancy in tag context Redundancy at tag level : No redundancy at tag level :is set empirically

A Relative Count Matrix is constructed Given two tags Ti and Tj, the relative importance of Tj with that of higher ranked Tj is calculated by dividing them both by P(Tj|D) (shows how many Tj tags are worth one Ti) Tj is considered only after P(Ti|D)/P(Tj|D) number of Ti tags have been considered. Extending the above concept, a matrix with relative counts can be formed. TagProb. actor0.5 keyword0.3 trivia0.2 actorkeywordtriviaRow- wise Total actor 1--1 keyword 21-3 trivia 3216

Ocean’s Eleven.xml - Summaries

StepTagProb.No of tags available No of tags to be added No of tags added in the round Round 1 1.1actor0.53015 1.2keyword0.3292 1.3trivia0.2156- Total17 Round 2 2.1actor0.7151599 (24) -keyword0000 (2) 2.2trivia0.2851544 (4) Total30 Generating the summary with 30 tags

Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.

Similar presentations

Presentation on theme: "Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.

Similar presentations

Presentation on theme: "Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example."— Presentation transcript:

Similar presentations

About project

Feedback