Download presentation
Presentation is loading. Please wait.
Published byRonald Ellis Modified over 9 years ago
1
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004
2
Main Contents Identify novelty of news stories given preceding news a user has read Newsjunkie: a set of algorithms for different (but related) tasks Technique: text collection comparison Tasks: –Ranking news by novelty –Personalized news updates –Characterization of relevance types of articles Evaluation or Examples
3
Review: Text Comparison Syntactic differences b/w Web pages – e.g :AT&T Internet Difference Engine Characteristic words – e.g: genre classification Language models for entire collections – e.g: corpus linguistics Comparing one set of documents to another – e.g: MMR (Maximum Marginal Relevance) – Newsjunkie
4
Research Problems Focus on temporal aspects of content difference – automatically assess the novelty over time of news articles coming from live newsfeeds. Look for documents most dissimilar from documents reviewed earlier – limitation: output entire documents rather than novel parts of multiple documents => much harder : + IE + summarization
5
Difference of Text Content KL divergence Density of new named entities – assumption: novelty is often conveyed by introducing new named entities ? Is normalization reasonable? What we need is new info. regardless how long the document is.
6
Task 1: news ranking
7
Evaluation 1 User evaluate on 3 distance metrics, 12 topics –KL divergence; density of NE; chronological order Each metric produced a set of 3 novel documents Users judge which set is the most novel Statistical significance tests on mean ranks –KL & NE are superior than chronological order –No significant difference b/w KL & NE ? Not consider the order of the 3 articles, while the question is ranking! ? Statistical tests only on mean, how about variance?
8
Task 2: personalized news update Task 2.1 single daily update – articles on the preceding day as background – user specify a novelty threshold Future work: consider more previous articles with weights decaying with age No evaluation in this part
9
Task 2.2: breaking news report detect new information about a story preceding articles within a sliding window as background – empirically, size of 40 articles Filtering out delayed reports and recaps – those are narrow spikes in a distance graph based on the nature of news reports – median filter filters out narrow spikes – empirically, width of filter : 5 ? parameters setting
10
Task 2.2: example
11
Task 3: relevance type of articles Four types of relevance to background – Recap: repeat old stuff, – Elaboration: add new info. – Offshoot: mainly about another topic – Irrelevant: totally different topic Identify them using intra-document dynamics
12
Task 3: intra-document dynamics Estimate relevance of different parts within a document Sliding window with a fixed size Compare content within the window to background Plot the distance scores Identify different patterns
13
What will the graph of a irrelevant article look like? -- Higher absolute scores, but small dynamic range
14
Contributions Novel novelty metric –density of named entities Evaluation by users Breaking news detection – novel adoption of median filter Characterization of article types – intra-story pattern novelty
15
Limitations Generalization of the metric on named entities: – works well on news domain, but others? User evaluation: too coarse – without considering order of articles – used old news which users had seen before the tests Claimed “personalized”, but only provided flexibility in threshold and, possibly, article relevance type selection Better if it can identify novel parts – or maybe not, keep integrity of a piece of news
16
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.