SnakeT A personalized search engine Paolo Ferragina Dipartimento di Informatica, Università di Pisa (Joint with Antonio Gullì) To be presented at WWW 2005.

Slides:



Advertisements
Similar presentations
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Advertisements

Chapter 5: Introduction to Information Retrieval
Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Information Retrieval in Practice
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
By Andrei Broder, IBM Research 1 A Taxonomy of Web Search Presented By o Onur Özbek o Mirun Akyüz.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Searching the Web II. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
(c) Maria Indrawan Distributed Information Retrieval.
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Help People Find What They Don ’ t Know Hao Ma CSE, CUHK.
Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
 Search Engine Search Engine  Steps to Search for webpages pertaining to a specific information Steps to Search for webpages pertaining to a specific.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
29-30 October, 2006, Estonia 1 IST4Balt Information analysis using social bookmarking and other tools IST4Balt Information analysis using social bookmarking.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Search Engines: The players and the field The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Dipartimento di Informatica, Pisa
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Information Retrieval
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan.
Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec
Information Retrieval in Practice
Information Retrieval in Practice
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Search Engine Architecture
Information Retrieval (in Practice)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presentation transcript:

SnakeT A personalized search engine Paolo Ferragina Dipartimento di Informatica, Università di Pisa (Joint with Antonio Gullì) To be presented at WWW 2005

Retrieve the docs that are “relevant” for the user query  Doc: file word or pdf, web page, , blog, book,...  Query: paradigm “bag of words”  Relevant ? Subjective and time-varying concept Users are lazy Selective queries are difficult to be composed Web pages are heterogeneous, numerous and changing frequently  Web search is a difficult, cyclic process Goal of a Search Engine Query Results Analysis

Evolution of Search Engines  First generation -- use only on-page, text data Word frequency and language  Second generation -- use off-page, web-specific data Link (or connectivity) analysis Anchor-text (How people refer to this page)  Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Click-through data Context determination Helping the user AV, Excite, Lycos, etc From Made popular by Google, now everyone Still experimental

User Needs Informational – want to learn about something (~40%) Navigational – want to go to that page (~25%) Transactional – want to do something (~35%)  Access a service  Downloads  Shop Gray areas  Find a good hub  Exploratory search “see what’s there” Asthma Alitalia Haifa weather Mars surface images Nikon CoolPix Car rental in Finland

Queries  ill-defined queries Short  2001: 2.54 terms avg  80% less than 3 terms Imprecise terms 78% are not modified  Wide variance in Needs Expectations Knowledge Patience: 85% look at 1 page

Web is highly dynamic [weekly snapshot during 2004: 154 web sites, 3  5 mil pg, 65Gb] A “good” coverage of the Web is difficult/impossible Normalized wrt first week

Different Coverage Google vs Yahoo Share 3.8 results in the top 10 on avg Share 23% in the top 100 on avg

The current scenario [Gullì-Signorini, www 05] Legenda G = Google M = MSN T = Teoma Y = Yahoo Indexable Web is about 11.5 billions of web pages Google is about 67% of any other search engine, Y 59%, M 49%, T 43% Intersection of these indexes is about 2.7 billions of pages

In summary Current search engines incur in many difficulties:  Link-based ranking may be inadequate: bags of words paradigm, ambiguous queries, polarized queries, …  Coverage of one search engine is poor, meta-search engines cover more but “difficult” to fuse multiple sources  User needs are subjective and time-varying  Users are lazy and look to few results A new “complementary” approach is Web-Snippet Hierarchical Clustering

The user has no longer to browse through tedious pages of results, but (s)he may navigate a hierarchy of folders whose labels give a glimpse of all themes present in the returned results. An interesting approach Web-snippet

 The folder hierarchy must be formed  “on-the-fly from the snippets”: because it must adapt to the themes of the results without any costly remote access to the original web pages or documents  “and his folders may overlap”: because a snippet may deal with multiple themes Canonical clustering is instead persistent and generated only once  The folder labels must be formed  “on-the-fly from the snippets” because labels must capture the potentially unbounded themes of the results without any costly remote access to the original web pages or documents.  “and be intelligible sentences” because they must facilitate the user post-navigation It seems a “document organization into topical context”, but snippets are poorly composed, no “structural information” is available for them, and “static classification” into predefined categories would be not appropriate. Web-Snippet Hierarchical Clustering

 We may identify four main approaches (ie. taxonomy)  Single words and Flat clustering – Scatter/Gather, WebCat, Retriever  Sentences and Flat clustering – Grouper, Carrot2, Lingo, Microsoft China  Single words and Hierarchical clustering – FIHC, Credo  Sentences and Hierarchical clustering – Lexical Affinities clustering, Hierarchical Grouper, SHOC, CIIRarchies, Highlight, IBM India  Conversely, we have many commercial proposals:  Northerlight (stopped 2002)  Copernic, Mooter, Kartoo, Groxis, Clusty, Dogpile, iBoogie,…  Vivisimo is surely the best ! The Literature Legenda  Software is avalilable  Latest result, but no software

To be presented at WWW 2005

SnakeT ’s main features 2 knowledge bases for ranking/choosing the labels DMOZ is used as a feature selection and sentence ranker index Text anchors are used for snippets enrichment Labels are gapped sentences of variable length Grouper’s extension, to match sentences which are “almost the same” Lexical Affinities clustering extension to k-long LAs Hierarchy formation deploys the folder labels and coverage “Primary” and “secondary” labels for finer/coarser clustering Syntactic and covering pruning rules for simplification and compaction 18 engines (Web, news and books) are queried on-the-fly Google, Yahoo, Teoma, A9 Amazon, Google-news, etc.. They are used as black-boxes

Generation of the Candidate Labels … build frequent gapped sentences (this extends other approaches)  Extract all word pairs occurring in the snippets within some proximity window  Rank them by exploiting: KB + frequency within snippets  Discard the pairs whose rank is below a threshold  Merge repeatedly the remaining pairs by taking into account their original position, their order, and the sentence boundary within the snippets … We have so generated the candidate labels …

Hierarchical Clustering (base case) Key feature is readability - Labels should be fairly distinct among folder descendants and siblings. Initially build the folder leaves:  Build one folder C i from each candidate label L i. The folder includes the snippets that contain the gapped sentence L i L i is called primary label of the folder C i  Enrich C i with secondary labels S i which are gapped sentences with good rank and occurring in at least the 80% of C i ’s snippets The signature of the folder C i is the string L i $S 1 $ … $S k … We have so generated the candidate leaves and their labels…

Hierarchical Clustering (induction)  Parent formation: build the next level of the hierarchy  Find substrings shared by the folder signatures  If  occurs in the signatures of C 1 … C z then we create the father of these folders having primary label , and secondary labels derived from the ones of C 1 … C z provided that the “80% threshold” is satisfied  Ranking: Clean-up some folders based on ranking  Compute the rank of the primary labels of these new folders  Discard the folders having low rank  Pruning: based on hierarchy “readability” and “compactness”  Context pruning: greedy approach to a graph covering problem (NPH)  Content pruning: empirical entropy about word distribution plus … … We have so generated the hierarchy of labeled folders…

Our personalization aims for... SnakeT

SnakeT ’s personalization Many interesting features Full-adaptivity to the variegate user needs/behaviors Scalable to many users and their unbounded profiles Privacy protection: no login or profiled information is required/used Black-box: no change in the underlying (unpersonalized) search engine The user first gets a “glimpse” on all themes of results, without scanning all of them, and then selects (filters) the results interesting for him/her.

Evaluating SnakeT Large set of experiments: 77 queries on 16 engines (2004) Meta-search is useful: The use of many search engines increases the number and distinctness of the extracted themes Text-anchors and DMOZ improve precision of the labels Gapped sentences improve the compaction of the hierarchy and the readability of the labels See the paper

Conclusions  Check it at: snaket.di.unipi.it  We are searching for: Linguistic tools that improve the label extraction process Re-ranking techniques that exploit the folders selected by the user during the personalization session Deal with click-through data for better personalization and re- ranking [WWW 05] Algorithmic approach to semantic similarity for better ranking based on DMOZ and other KBs [WWW 05]  Desktop search engine Patent pending !?!