A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan.

Slides:

Advertisements

Similar presentations

Yansong Feng and Mirella Lapata

Advertisements

Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.

LINGO Sandra Gama. Internet  endless document collection.

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.

Online Clustering of Web Search results

Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.

The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay.

Crawling the Hidden Web Sriram Raghavan Hector Stanford University.

Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:

Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.

1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Chapter 5: Information Retrieval and Web Search

Help People Find What They Don ’ t Know Hao Ma CSE, CUHK.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

Information Retrieval in Practice

WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.

SnakeT A personalized search engine Paolo Ferragina Dipartimento di Informatica, Università di Pisa (Joint with Antonio Gullì) To be presented at WWW 2005.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Chapter 6: Information Retrieval and Web Search

From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,

Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Dipartimento di Informatica, Pisa

Introduction to Text Mining By Soumyajit Manna 11/10/08.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Topical Clustering of Search Results Scaiella et al [Originally published in – “Proceedings of the fifth ACM international conference on Web search and.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Generating Query Substitutions Alicia Wood. What is the problem to be solved?

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.

Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.

Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.

UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.

A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç

1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.

Large Scale Search: Inverted Index, etc.

Linguistic Graph Similarity for News Sentence Searching

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CIW Lesson 6 Web Search Engines.

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Information Retrieval

Introduction to Information Retrieval

Mining Anchor Text for Query Refinement

Recuperação de Informação B

Presentation transcript:

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan

Clustering Web Search Results Challenges:  On short snippets instead of whole docs  Clustering must be done on the fly  Clusters should be labeled with meaningful text (accurate and intelligible)  Clusters need to be distinctive Vivisimo S NAKE T

Categorization of Works Flat clustering vs. Hierarchical clustering Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)

Preprocessing Fetch snippets from 16 search engines Enrich snippets with anchor texts from a crawled database of 200M web pages

Identification of Candidate Phrases for Labels Enumerate all pairs of words within a certain proximity window (of size 4) in snippets Score them based on:  NLP features: PoS, NE  ODP occurrences: term frequency (col freq * inv cat freq?), containing category Discard low-score pairs

Identification of Candidate Phrases for Labels (cont.) Word pairs are atomic phrases (how about single words?) Incrementally merge word pairs into longer phrases (preserve ordering and limit size) Score phrases based on its constitutes’ scores Discard low-score phrases

Hierarchical Clustering Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping Primary label: the aforementioned candidate phrase Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster

Hierarchical Clustering (cont.) Merge atomic clusters into candidate second- level clusters if they share primary/secondary labels Primary label: the shared label Secondary label: other labels occurring in 80% of the snippets in the cluster Prune second-level clusters that are have similar coverage or similar labels Recursively produce third-level clusters

How S NAKE T can be Used Hierarchical browsing for knowledge extraction Hierarchical browsing for result selection Query reformulation Personalized ranking(?)

Evaluation

Evaluation (cont.)

Clustering technology: PageRank of the future? Pros:  Ambiguous query: narrow down result list  Less-ambiguous query: get a bird’s eye view of different aspects Cons:  Clustering is slow but often unnecessary  Takes time to look at the clusters  Cluster and label quality still to be desired