PNFS: PERSONALIZED NEWS FILTERING & SUMMARIZATION ON THE WEB Authors: Xindong Wu, Fei Xie, Gongqing Wu, Wei Ding 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2011 Presenter Rashida Hasan April 06, 2017 Center for Advanced Computer Studies University of Louisiana at Lafayette
Overview Motivation Objective Paper Overview Personalized News Filtering & Summarization (PNFS) Experiments & Results Related Work Summary Exam Questions Center for Advanced Computer Studies University of Louisiana at Lafayette
Motivation The indexed Web contains at least 5.02 billion pages (Tuesday, 08 September, 2015) Center for Advanced Computer Studies University of Louisiana at Lafayette
Objectives Recommendation, extraction, and summarization of interesting and useful information from web pages according to a user's personal preference. Applications include public opinion investigation, intelligence gathering and monitoring, topic tracking and employment services. Center for Advanced Computer Studies University of Louisiana at Lafayette
Paper Overview A web news recommendation mechanism : A news filter : Recommend interesting news to users Obtain news from Google news (http://news.google.com) A news filter : Provide high quality news content for analyzing An embedded learning component Summarize web news : Summarize useful and interesting information Given in the form of keywords based on lexical chains A keyword knowledge base is stored A keyword extraction algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette
Personalized News Filtering & Summarization (PNFS) Personalized Web News Filtering Web News Summarization Keyword Knowledgebase Center for Advanced Computer Studies University of Louisiana at Lafayette
PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Phase 2 :Web News Summarization Center for Advanced Computer Studies University of Louisiana at Lafayette
PNFS System Architecture Fig 1 : PNFS System Architecture Center for Advanced Computer Studies University of Louisiana at Lafayette
PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Filter out news stories uninteresting to the user Filter out non news parts on news web pages Phase 2 :Web News Summarization Center for Advanced Computer Studies University of Louisiana at Lafayette
System Architecture : Personalized Web News Filtering The Internet Web news News Aggregator News Filter Recommended news Learning Component Keyword Knowledgebase User Feedback Center for Advanced Computer Studies University of Louisiana at Lafayette
PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Filter out news stories uninteresting to the user Filter out non news parts on news web pages Phase 2 :Web News Summarization Gives a concise form of the news web page to the user that saves the reading time The extracted keywords are also used to build a user interest model Center for Advanced Computer Studies University of Louisiana at Lafayette
System Architecture: Web News Summarization Keyword Knowledge base Lexical thesaurus Word Segmentation Filtered News Compute Word Similarity and Co-Occurrence Frequency Extract Candidate words Construct Lexical Chains Center for Advanced Computer Studies University of Louisiana at Lafayette
PNFS System Architecture Fig 1 : PNFS System Architecture Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation Feature Selection method to obtain the total word vocabulary Recommendation algorithm to track the news event that the user would focus on based on the K-nearest neighbor algorithm Probability model to recommend the topic interesting news using Naïve Bayes algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation : Feature Selection Feature Selection (D,k,m,n) Input: D: document set of all categories; k: the number of features selected for each document; m: the number of topic categories; n: the number of features selected for each topic category Output : F : selected feature set For each category C For each document d ϵ Ci extract top k keywords from d sort all words in according to the number of times they appear in the top k keyword lists Fi=the n most frequent words in Ci Return F= F1 ∪ F2 ∪…….. ∪ Fm Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation : Feature Selection Most recent 5000 documents per category are collected for feature selection Number of keywords extracted for each document is 50 Number of features selected for each topic category is 1000 Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation : Feature Selection The top 50 features selected for the technology topic category are as follows : apple, company, google, iphone, user, mobile, vehicle, corp, software, app, video, technology, samsung, computer, internet, billion, safety, traffic, facebook, microsoft, credit, smartphone, service, ipad, version, price, web, office, honda, market, competitor, android, federal, gas, search, posted, major, executive, offline, site, incident, system, patent, security, youtube, device, model, industry, network, map Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation Feature Selection method to obtain the total word vocabulary Recommendation algorithm to track the news event that the user would focus on based on the K-nearest neighbor algorithm Probability model to recommend the topic interesting news using Naïve Bayes algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation : Tracking News After feature selection each news article is represented as a vector by the TFIDF term weighting scheme. tf: the frequency of term ti in the given web page N: the # of documents in the corpus ni: the # of documents in the corpus that contain term ti Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation : Tracking News The cosine measure is used to compute the similarity of two vectors Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation : Tracking News Tracking_News (t1 , t2, k,n) Input: t1 , t2 :similarity threshold (o.3, 0.6); k: the number of nearest neighbors (20); Output : the top n tracking news stories and the candidate news for topic interesting news recommendation. for each upcoming news story do calculate the similarities of the news story with the users recently read stories and get k most nearest neighbors if one of the k similarities is larger than t2 label the upcoming story as redundant; continue; if the average of the k similarities is larger than t1 put the new story into the tracking news queue; if the average of the k similarities is less than t1 put the new story into the candidate news queue; recommend the top n stories in the tracking news queue in the descending order of the average similarity. Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation Feature Selection method to obtain the total word vocabulary Recommendation algorithm to track the news event that the user would focus on based on the K-nearest neighbor algorithm Probability model to recommend the topic interesting news using Naïve Bayes algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation: Interesting Topic News K-nearest Neighbor performs well in tracking news events and finding novel news Failed to reflect the diversity of user interest Solution : Learning model based on Naïve Bayes Naïve Bayes calculate the probability of a news story being interesting Each news story is represented as feature-value vector The Naïve Bayes classifier is built to calculate the topic distribution of user interests. Center for Advanced Computer Studies University of Louisiana at Lafayette
Personal Web News Recommendation: Interesting Topic News The probability that document d is recommended to user u is computed as follows: Center for Advanced Computer Studies University of Louisiana at Lafayette
PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Filter out news stories uninteresting to the user Filter out non news parts on news web pages Phase 2 :Web News Summarization Gives a concise form of the news web page to the user that saves the reading time The extracted keywords are also used to build a user interest model Center for Advanced Computer Studies University of Louisiana at Lafayette
Keyword Extraction Based on Lexical Chain (KLC) Lexical Chain : A lexical chain is a sequence of words with related sense The interpretation is composed of several disjoint lexical chains All the possible interpretation form the interpretation space The interpretation with the largest cohesion value represent the correct sense of the words in the text. The cohesion value is defined by the sum of similarities between the words in lexical chains Center for Advanced Computer Studies University of Louisiana at Lafayette
Keyword Extraction Based on Lexical Chain (KLC) Fig 2 :An example of resolving word sense ambiguity Center for Advanced Computer Studies University of Louisiana at Lafayette
Keyword Extraction Based on Lexical Chain (KLC) Word Co-occurrence Model : Dice coefficient is used to compute the word relatedness degree. Let x and y be two basic events in the probability space, representing the occurrence of words in a document. The Dice coefficient is defined as Center for Advanced Computer Studies University of Louisiana at Lafayette
Keyword Extraction Based on Lexical Chain (KLC) Keyword extraction algorithm 1. Words are segmented and stemmed, and stop words are removed 2. Select the top n words by TFIDF as candidate words 3. Build the disambiguation graph in which each node is a candidate word 4. Perform the word sense disambiguation for each candidate word 5. Build the actual lexical chains 6. Compute the weight of each candidate word wi as follows: 7. Select the top m words as the keywords extracted from the candidate words by their weights Center for Advanced Computer Studies University of Louisiana at Lafayette
Keyword Extraction Based on Lexical Chain (KLC) n is 30 based on empirical studies n should be between 20 and 50 If n is smaller than 20, the advantages of semantic relation would not be evident If it is greater than 50, the importance of word frequency to the extracted keywords would be reduced t3 is set to 0.3 Center for Advanced Computer Studies University of Louisiana at Lafayette
Keyword Extraction Evaluation Center for Advanced Computer Studies University of Louisiana at Lafayette
Experiments & Results: Experiment 1: Impact of # of Keywords Selected Fig 3 : The precision of KLC with three different feature sets Center for Advanced Computer Studies University of Louisiana at Lafayette
Experiments & Results Experiment 1: Impact of # of Keywords Selected Fig 4 : The recall of KLC with three different feature sets Center for Advanced Computer Studies University of Louisiana at Lafayette
Experiments & Results Experiment 2: Nouns and Verbs Fig 5 : The precision of KLC with two different candidate sets Center for Advanced Computer Studies University of Louisiana at Lafayette
Experiments & Results Experiment 2: Nouns and Verbs Fig 6 : The recall of KLC with two different candidate sets Center for Advanced Computer Studies University of Louisiana at Lafayette
The Interface of the PNFS System Center for Advanced Computer Studies University of Louisiana at Lafayette
The Interface of the PNFS System Center for Advanced Computer Studies University of Louisiana at Lafayette
An Original Web News Example Center for Advanced Computer Studies University of Louisiana at Lafayette
flood, home, end, year, overall, worst Australia, Queensland, state Extracted Keywords : flood, home, year, record, end, Australia, state, overall, Queensland, worst Lexical chains : flood, home, end, year, overall, worst Australia, Queensland, state record Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work… Collaborativee Filtering Hybrid Recommendation Recommender systems Content Based Recommendation Collaborativee Filtering Hybrid Recommendation Fig 7 : Different Techniques used in recommendation system Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work… Web information extraction : Three Targets Records in a web page Specific interesting attributes Main content of the page Most web information exploration system use extraction rules Extraction rules are represented by regular grammars, first order logic or a tag tree with features Features include HTML tags, literal words, DOM tree paths, part-of-speech taggers, token lengths etc. W4F:DOM tree path to address a web page Chakrabarti uses URL tokens, HTML titles and keywords NFAS: Filtering and summarizing Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette
Related Work Keyword Extraction Supervised extraction Unsupervised extraction Supervised extraction : GenEX, Kea System Unsupervised extraction: Graph based ranking method, candidate phrase generation method, lexical chain based method Center for Advanced Computer Studies University of Louisiana at Lafayette
Summary Designed a content-based news recommender using two learning strategies to model the user interest preference A news filter is used to filter out the advertisements and other irrelevant parts on the news Web page Presented a new keyword extraction method based on semantic relations Studied the semantic relations between words based on lexical thesaurus and word co-occurrence Constructed lexical chains to link the relations and extract keywords Center for Advanced Computer Studies University of Louisiana at Lafayette
Exam Questions Q1: What are the Application of PNFS Public opinion investigation Intelligence gathering and monitoring Topic tracking Employment Services Q2: What are the components for Personalized Filtering Subsystem ? A news aggregator A news filter A learning component A Keyword knowledge base Center for Advanced Computer Studies University of Louisiana at Lafayette
Exam Questions Q3: How to represent news article by TFIDF term weighting scheme ? Answer: tf: the frequency of term ti in the given web page N: the # of documents in the corpus ni: the # of documents in the corpus that contain term ti Center for Advanced Computer Studies University of Louisiana at Lafayette
Thank You !! Questions ? Center for Advanced Computer Studies University of Louisiana at Lafayette