PNFS: PERSONALIZED NEWS FILTERING & SUMMARIZATION ON THE WEB

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Improved TF-IDF Ranker
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Abdelghani Bellaachia and Mohammed Al-Dhelaan 2012, WIIAT NE-Rank: A Novel Graph-based.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Tag-based Social Interest Discovery
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
The identification of interesting web sites Presented by Xiaoshu Cai.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Chapter 23: Probabilistic Language Models April 13, 2004.
Evgeniy Gabrilovich and Shaul Markovitch
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Information Retrieval in Practice
CSCE 590 Web Scraping – Information Extraction II
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Research on Knowledge Element Relation and Knowledge Service for Agricultural Literature Resource Xie nengfu; Sun wei and Zhang xuefu 3rd April 2017.
Clustering of Web pages
Information Retrieval and Web Search
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval and Web Search
Information Retrieval and Web Search
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Inf 722 Information Organisation
Ying Dai Faculty of software and information science,
Building Topic/Trend Detection System based on Slow Intelligence
Information Retrieval and Web Search
Connecting the Dots Between News Article
Presentation transcript:

PNFS: PERSONALIZED NEWS FILTERING & SUMMARIZATION ON THE WEB Authors: Xindong Wu, Fei Xie, Gongqing Wu, Wei Ding 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2011 Presenter Rashida Hasan April 06, 2017 Center for Advanced Computer Studies University of Louisiana at Lafayette

Overview Motivation Objective Paper Overview Personalized News Filtering & Summarization (PNFS) Experiments & Results Related Work Summary Exam Questions Center for Advanced Computer Studies University of Louisiana at Lafayette

Motivation The indexed Web contains at least 5.02 billion pages (Tuesday, 08 September, 2015) Center for Advanced Computer Studies University of Louisiana at Lafayette

Objectives Recommendation, extraction, and summarization of interesting and useful information from web pages according to a user's personal preference. Applications include public opinion investigation, intelligence gathering and monitoring, topic tracking and employment services. Center for Advanced Computer Studies University of Louisiana at Lafayette

Paper Overview A web news recommendation mechanism : A news filter : Recommend interesting news to users Obtain news from Google news (http://news.google.com) A news filter : Provide high quality news content for analyzing An embedded learning component Summarize web news : Summarize useful and interesting information Given in the form of keywords based on lexical chains A keyword knowledge base is stored A keyword extraction algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette

Personalized News Filtering & Summarization (PNFS) Personalized Web News Filtering Web News Summarization Keyword Knowledgebase Center for Advanced Computer Studies University of Louisiana at Lafayette

PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Phase 2 :Web News Summarization Center for Advanced Computer Studies University of Louisiana at Lafayette

PNFS System Architecture Fig 1 : PNFS System Architecture Center for Advanced Computer Studies University of Louisiana at Lafayette

PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Filter out news stories uninteresting to the user Filter out non news parts on news web pages Phase 2 :Web News Summarization Center for Advanced Computer Studies University of Louisiana at Lafayette

System Architecture : Personalized Web News Filtering The Internet Web news News Aggregator News Filter Recommended news Learning Component Keyword Knowledgebase User Feedback Center for Advanced Computer Studies University of Louisiana at Lafayette

PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Filter out news stories uninteresting to the user Filter out non news parts on news web pages Phase 2 :Web News Summarization Gives a concise form of the news web page to the user that saves the reading time The extracted keywords are also used to build a user interest model Center for Advanced Computer Studies University of Louisiana at Lafayette

System Architecture: Web News Summarization Keyword Knowledge base Lexical thesaurus Word Segmentation Filtered News Compute Word Similarity and Co-Occurrence Frequency Extract Candidate words Construct Lexical Chains Center for Advanced Computer Studies University of Louisiana at Lafayette

PNFS System Architecture Fig 1 : PNFS System Architecture Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation Feature Selection method to obtain the total word vocabulary Recommendation algorithm to track the news event that the user would focus on based on the K-nearest neighbor algorithm Probability model to recommend the topic interesting news using Naïve Bayes algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation : Feature Selection Feature Selection (D,k,m,n) Input: D: document set of all categories; k: the number of features selected for each document; m: the number of topic categories; n: the number of features selected for each topic category Output : F : selected feature set For each category C For each document d ϵ Ci extract top k keywords from d sort all words in according to the number of times they appear in the top k keyword lists Fi=the n most frequent words in Ci Return F= F1 ∪ F2 ∪…….. ∪ Fm Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation : Feature Selection Most recent 5000 documents per category are collected for feature selection Number of keywords extracted for each document is 50 Number of features selected for each topic category is 1000 Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation : Feature Selection The top 50 features selected for the technology topic category are as follows : apple, company, google, iphone, user, mobile, vehicle, corp, software, app, video, technology, samsung, computer, internet, billion, safety, traffic, facebook, microsoft, credit, smartphone, service, ipad, version, price, web, office, honda, market, competitor, android, federal, gas, search, posted, major, executive, offline, site, incident, system, patent, security, youtube, device, model, industry, network, map Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation Feature Selection method to obtain the total word vocabulary Recommendation algorithm to track the news event that the user would focus on based on the K-nearest neighbor algorithm Probability model to recommend the topic interesting news using Naïve Bayes algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation : Tracking News After feature selection each news article is represented as a vector by the TFIDF term weighting scheme. tf: the frequency of term ti in the given web page N: the # of documents in the corpus ni: the # of documents in the corpus that contain term ti Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation : Tracking News The cosine measure is used to compute the similarity of two vectors Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation : Tracking News Tracking_News (t1 , t2, k,n) Input: t1 , t2 :similarity threshold (o.3, 0.6); k: the number of nearest neighbors (20); Output : the top n tracking news stories and the candidate news for topic interesting news recommendation. for each upcoming news story do calculate the similarities of the news story with the users recently read stories and get k most nearest neighbors if one of the k similarities is larger than t2 label the upcoming story as redundant; continue; if the average of the k similarities is larger than t1 put the new story into the tracking news queue; if the average of the k similarities is less than t1 put the new story into the candidate news queue; recommend the top n stories in the tracking news queue in the descending order of the average similarity. Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation Feature Selection method to obtain the total word vocabulary Recommendation algorithm to track the news event that the user would focus on based on the K-nearest neighbor algorithm Probability model to recommend the topic interesting news using Naïve Bayes algorithm Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation: Interesting Topic News K-nearest Neighbor performs well in tracking news events and finding novel news Failed to reflect the diversity of user interest Solution : Learning model based on Naïve Bayes Naïve Bayes calculate the probability of a news story being interesting Each news story is represented as feature-value vector The Naïve Bayes classifier is built to calculate the topic distribution of user interests. Center for Advanced Computer Studies University of Louisiana at Lafayette

Personal Web News Recommendation: Interesting Topic News The probability that document d is recommended to user u is computed as follows: Center for Advanced Computer Studies University of Louisiana at Lafayette

PNFS System Architecture PNFS system consists of two phases Phase 1 : Personalized Web News Filtering Filter out news stories uninteresting to the user Filter out non news parts on news web pages Phase 2 :Web News Summarization Gives a concise form of the news web page to the user that saves the reading time The extracted keywords are also used to build a user interest model Center for Advanced Computer Studies University of Louisiana at Lafayette

Keyword Extraction Based on Lexical Chain (KLC) Lexical Chain : A lexical chain is a sequence of words with related sense The interpretation is composed of several disjoint lexical chains All the possible interpretation form the interpretation space The interpretation with the largest cohesion value represent the correct sense of the words in the text. The cohesion value is defined by the sum of similarities between the words in lexical chains Center for Advanced Computer Studies University of Louisiana at Lafayette

Keyword Extraction Based on Lexical Chain (KLC) Fig 2 :An example of resolving word sense ambiguity Center for Advanced Computer Studies University of Louisiana at Lafayette

Keyword Extraction Based on Lexical Chain (KLC) Word Co-occurrence Model : Dice coefficient is used to compute the word relatedness degree. Let x and y be two basic events in the probability space, representing the occurrence of words in a document. The Dice coefficient is defined as Center for Advanced Computer Studies University of Louisiana at Lafayette

Keyword Extraction Based on Lexical Chain (KLC) Keyword extraction algorithm 1. Words are segmented and stemmed, and stop words are removed 2. Select the top n words by TFIDF as candidate words 3. Build the disambiguation graph in which each node is a candidate word 4. Perform the word sense disambiguation for each candidate word 5. Build the actual lexical chains 6. Compute the weight of each candidate word wi as follows: 7. Select the top m words as the keywords extracted from the candidate words by their weights Center for Advanced Computer Studies University of Louisiana at Lafayette

Keyword Extraction Based on Lexical Chain (KLC) n is 30 based on empirical studies n should be between 20 and 50 If n is smaller than 20, the advantages of semantic relation would not be evident If it is greater than 50, the importance of word frequency to the extracted keywords would be reduced t3 is set to 0.3 Center for Advanced Computer Studies University of Louisiana at Lafayette

Keyword Extraction Evaluation Center for Advanced Computer Studies University of Louisiana at Lafayette

Experiments & Results: Experiment 1: Impact of # of Keywords Selected Fig 3 : The precision of KLC with three different feature sets Center for Advanced Computer Studies University of Louisiana at Lafayette

Experiments & Results Experiment 1: Impact of # of Keywords Selected Fig 4 : The recall of KLC with three different feature sets Center for Advanced Computer Studies University of Louisiana at Lafayette

Experiments & Results Experiment 2: Nouns and Verbs Fig 5 : The precision of KLC with two different candidate sets Center for Advanced Computer Studies University of Louisiana at Lafayette

Experiments & Results Experiment 2: Nouns and Verbs Fig 6 : The recall of KLC with two different candidate sets Center for Advanced Computer Studies University of Louisiana at Lafayette

The Interface of the PNFS System Center for Advanced Computer Studies University of Louisiana at Lafayette

The Interface of the PNFS System Center for Advanced Computer Studies University of Louisiana at Lafayette

An Original Web News Example Center for Advanced Computer Studies University of Louisiana at Lafayette

flood, home, end, year, overall, worst Australia, Queensland, state Extracted Keywords : flood, home, year, record, end, Australia, state, overall, Queensland, worst Lexical chains : flood, home, end, year, overall, worst Australia, Queensland, state record Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work… Collaborativee Filtering Hybrid Recommendation Recommender systems Content Based Recommendation Collaborativee Filtering Hybrid Recommendation Fig 7 : Different Techniques used in recommendation system Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work… Web information extraction : Three Targets Records in a web page Specific interesting attributes Main content of the page Most web information exploration system use extraction rules Extraction rules are represented by regular grammars, first order logic or a tag tree with features Features include HTML tags, literal words, DOM tree paths, part-of-speech taggers, token lengths etc. W4F:DOM tree path to address a web page Chakrabarti uses URL tokens, HTML titles and keywords NFAS: Filtering and summarizing Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work Recommender System Web News Extraction Keyword Extraction Center for Advanced Computer Studies University of Louisiana at Lafayette

Related Work Keyword Extraction Supervised extraction Unsupervised extraction Supervised extraction : GenEX, Kea System Unsupervised extraction: Graph based ranking method, candidate phrase generation method, lexical chain based method Center for Advanced Computer Studies University of Louisiana at Lafayette

Summary Designed a content-based news recommender using two learning strategies to model the user interest preference A news filter is used to filter out the advertisements and other irrelevant parts on the news Web page Presented a new keyword extraction method based on semantic relations Studied the semantic relations between words based on lexical thesaurus and word co-occurrence Constructed lexical chains to link the relations and extract keywords Center for Advanced Computer Studies University of Louisiana at Lafayette

Exam Questions Q1: What are the Application of PNFS Public opinion investigation Intelligence gathering and monitoring Topic tracking Employment Services Q2: What are the components for Personalized Filtering Subsystem ? A news aggregator A news filter A learning component A Keyword knowledge base Center for Advanced Computer Studies University of Louisiana at Lafayette

Exam Questions Q3: How to represent news article by TFIDF term weighting scheme ? Answer: tf: the frequency of term ti in the given web page N: the # of documents in the corpus ni: the # of documents in the corpus that contain term ti Center for Advanced Computer Studies University of Louisiana at Lafayette

Thank You !! Questions ? Center for Advanced Computer Studies University of Louisiana at Lafayette