Web IR: Recent Trends; Future of Web Search CSC 575 Intelligent Information Retrieval.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Chapter 5: Introduction to Information Retrieval
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations; Relevance Feedback; and Personalization CSC 575 Intelligent Information Retrieval.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
1 Information Retrieval and Web Search Introduction.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Information Retrieval in Practice
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Going Beyond Simple Question Answering Bahareh Sarrafzadeh CS 886 – Spring 2015.
Web 2.0: Concepts and Applications 6 Linking Data.
Social scope: Enabling Information Discovery On Social Content Sites
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Recommendation system MOPSI project KAROL WAGA
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
User Models for Personalization Josh Alspector Chief Technology Officer.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Personalized Search Xiao Liu
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Search Engine Architecture
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Information Retrieval
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
Lesson 6: Databases and Web Search Engines
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Lecture 19: Question Answering
Web IR: Recent Trends; Future of Web Search
ece 627 intelligent web: ontology and beyond
Lesson 6: Databases and Web Search Engines
Web Mining Research: A Survey
Information Retrieval and Web Search
Presentation transcript:

Web IR: Recent Trends; Future of Web Search CSC 575 Intelligent Information Retrieval

2 Behavior-Based Ranking  Emergence of large-scale search engines allow for mining aggregate behavior analysis to improving ranking.  Basic Idea:  For each query Q, keep track of which docs in the results are clicked on  On subsequent requests for Q, re-order docs in results based on click-throughs.  Relevance assessment based on  Behavior/usage  vs. content

Intelligent Information Retrieval 3 Query-doc popularity matrix B Queries Docs q j B qj = number of times doc j clicked-through on query q When query q issued again, order docs by B qj values.

Intelligent Information Retrieval 4 Vector space implementation  Maintain a term-doc popularity matrix C  as opposed to query-doc popularity  initialized to all zeros  Each column represents a doc j  If doc j clicked on for query q, update C j  C j +  q (here q is viewed as a vector).  On a query q’, compute its cosine proximity to C j for all j.  Combine this with the regular text score.

Intelligent Information Retrieval 5 Issues  Normalization of C j after updating  Assumption of query compositionality  “white house” document popularity derived from “white” and “house”  Updating - live or batch?  Basic assumption:  Relevance can be directly measured by number of click throughs  Valid?

“Information retrieval”  The name “information retrieval” is standard, but as traditionally practiced, it’s not really right  All you get is document retrieval, and beyond that the job is up to you  Which technologies will help us get closer to the real promise of IR?

Intelligent Information Retrieval 7 Learning interface agents  Add agents to the user interface and delegate tasks to them  Use machine learning to improve performance  learn user behavior, preferences  Useful when:  1) past behavior is a useful predictor of the future behavior  2) wide variety of behaviors amongst users  Examples:  mail clerk: sort incoming messages in right mailboxes  calendar manager: automatically schedule meeting times?  Personal news agents  portfolio manager agents  Advantages:  less work for user and application writer  adaptive behavior  user and agent build trust relationship gradually

Intelligent Information Retrieval 8 Letizia: Autonomous Interface Agent (Lieberman 96)  Recommends web pages during browsing based on user profile  Learns user profile using simple heuristics  Passive observation, recommend on request  Provides relative ordering of link interestingness  Assumes recommendations “near” current page are more valuable than others user letizia user profile heuristicsrecommendations

Intelligent Information Retrieval 9 Letizia: Autonomous Interface Agent  Infers user preferences from behavior  Interesting pages  record in hot list (save as a file)  follow several links from pages  returning several times to a document  Not Interesting  spend a short time on document  return to previous document without following links  passing over a link to document (selecting links above and below document)  Why is this useful  tracks and learns user behavior, provides user “context” to the application (browsing)  completely passive: no work for the user  useful when user doesn’t know where to go  no modifications to application: Letizia interposes between the Web and browser

Intelligent Information Retrieval 10 Consequences of passiveness  Weak heuristics  example: click through multiple uninteresting pages en route to interestingness  example: user browses to uninteresting page, then goes for a coffee  example: hierarchies tend to get more hits near root  Cold start  No ability to fine tune profile or express interest without visiting “appropriate” pages  Some possible alternative/extensions to internally maintained profiles:  expose to the user (e.g. fine tune profile) ?  expose to other users/agents (e.g. collaborative filtering)?  expose to web server (e.g. cnn.com custom news)?

ARCH: Adaptive Agent for Retrieval Based on Concept Hierarchies (Mobasher, Sieg, Burke )  ARCH supports users in formulating effective search queries starting with users’ poorly designed keyword queries  Essence of the system is to incorporate domain-specific concept hierarchies with interactive query formulation  Query enhancement in ARCH uses two mutually- supporting techniques:  Semantic – using a concept hierarchy to interactively disambiguate and expand queries  Behavioral – observing user’s past browsing behavior for user profiling and automatic query enhancement

Intelligent Information Retrieval 12 Overview of ARCH  The system consists of an offline and an online component  Offline component:  Handles the learning of the concept hierarchy  Handles the learning of the user profiles  Online component:  Displays the concept hierarchy to the user  Allows the user to select/deselect nodes  Generates the enhanced query based on the user’s interaction with the concept hierarchy

Intelligent Information Retrieval 13 Offline Component - Learning the Concept Hierarchy  Maintain aggregate representation of the concept hierarchy  pre-compute the term vectors for each node in the hierarchy  Concept classification hierarchy - Yahoo

Intelligent Information Retrieval 14 Aggregate Representation of Nodes in the Hierarchy  A node is represented as a weighted term vector:  centroid of all documents and subcategories indexed under the node n = node in the concept hierarchy D n = collection of individual documents S n = subcategories under n T d = weighted term vector for document d indexed under node n T s = the term vector for subcategory s of node n

Intelligent Information Retrieval 15 Example from Yahoo Hierarchy

Intelligent Information Retrieval 16 Online Component – User Interaction with Hierarchy  The initial user query is mapped to the relevant portions of hierarchy  user enters a keyword query  system matches the term vectors representing each node in the hierarchy with the keyword query  nodes which exceed a similarity threshold are displayed to the user, along with other adjacent nodes.  Semi-automatic derivation of user context  ambiguous keyword might cause the system to display several different portions of the hierarchy  user selects categories which are relevant to the intended query, and deselects categories which are not

Intelligent Information Retrieval 17 Generating the Enhanced Query  Based on an adaptation of Rocchio's method for relevance feedback  Using the selected and deselected nodes, the system produces a refined query Q 2 :  each T sel is a term vector for one of the nodes selected by the user,  each T desel is a term vector for one of the deselected nodes  factors , , and  are tuning parameters representing the relative weights associated with the initial query, positive feedback, and negative feedback, respectively such that  +  -  = 1.

Intelligent Information Retrieval 18 An Example - Music Genres Artists New Releases Blues Jazz New Age... Dixieland music: 1.00, jazz: 0.44, dixieland: 0.20, tradition: 0.11, band: 0.10, inform: 0.10, new: 0.07, artist: 0.06 music: 1.00, jazz: 0.44, dixieland: 0.20, tradition: 0.11, band: 0.10, inform: 0.10, new: 0.07, artist: 0.06 Portion of the resulting term vector: Initial Query “music, jazz” Selected Categories “Music”, “jazz”, “Dixieland” Deselected Category “Blues”

Intelligent Information Retrieval 19 Another Example – ARCH Interface  Initial query = python  Intent for search = python as a snake  User selects Pythons under Reptiles  User deselects Python under Programming and Development and Monty Python under Entertainment  Enhanced query:

Intelligent Information Retrieval 20 Generation of User Profiles  Profile Generation Component of ARCH  passively observe user’s browsing behavior  use heuristics to determine which pages user finds “interesting”  time spent on the page (or similar pages)  frequency of visit to the page or the site  other factors, e.g., bookmarking a page, etc.  implemented as a client-side proxy server  Clustering of “Interesting” Documents  ARCH extracts feature vectors for each profile document  documents are clustered into semantically related categories  we use a clustering algorithm that supports overlapping categories to capture relationships across clusters  algorithms: overlapping version of k-means; hypergraph partitioning  profiles are the significant features in the centroid of each cluster

Intelligent Information Retrieval 21 User Profiles & Information Context  Can user profiles replace the need for user interaction?  Instead of explicit user feedback, the user profiles are used for the selection and deselection of concepts  Each individual profile is compared to the original user query for similarity  Those profiles which satisfy a similarity threshold are then compared to the matching nodes in the concept hierarchy  matching nodes include those that exceeded a similarity threshold when compared to the user’s original keyword query.  The node with the highest similarity score is used for automatic selection; nodes with relatively low similarity scores are used for automatic deselection

Intelligent Information Retrieval 22 Results Based on User Profiles

23 New Paradigms for Search: Social / Collaborative Tags

Example: Tags describe the Resource Tags can describe The resource (genre, actors, etc) Organizational (toRead) Subjective (awesome) Ownership (abc) etc

Tag Recommendation

 These systems are “collaborative.”  Recommendation / Analytics based on the “wisdom of crowds.” Tags describe the user Rai Aren's profile co-author “Secret of the Sands"

27 Example: Using Tags for Recommendation

New Paradigms for Search: Social Recommendation  A form of collaborative filtering using social network data  Users profiles represented as sets of links to other nodes (users or items) in the network  Prediction problem: infer a currently non-existent link in the network 28

What’s been happening now? (Google)  “Mobilegeddon” (Apr 21, 2015):  “Mobile friendliness” as a major ranking signal  “Pigeon” update (July 2014):  For U.S. English results, the “Pigeon Update” is a new algorithm to provide more useful, relevant and accurate local search results that are tied more closely to traditional web search ranking signals  More use of distance and location in ranking signals  “App Indexing” (Android, iOS support May 2015)  Search results can take you to an app  Why?  About half of all searches are now from mobile  Making/wanting good changes, but obvious self-interest in trying to keep people using mobile web rather than apps

What’s been happening now? (Google)  New search index at Google: “Hummingbird”  revamped-search-to-handle-your-long-questions/ revamped-search-to-handle-your-long-questions/  Answering long, “natural language” questions better  Partly to deal with spoken queries on mobile  More use of the Google Knowledge Graph  Concepts versus words

What’s been happening now?  Move to mobile favors a move to speech which favors “natural language information search”  Will we move to a time when over half of searches are spoken?

3 approaches to question answering: Knowledge-based approaches (Siri)  Build a semantic representation of the query  Times, dates, locations, entities, numeric quantities  Map from this semantics to query structured data or resources  Geospatial databases  Ontologies (Wikipedia infoboxes, dbPedia, WordNet, Yago)  Restaurant review sources and reservation services  Scientific databases  Wolfram Alpha 32

Text-based (mainly factoid) QA  Question Processing  Detect question type, answer type, focus, relations  Formulate queries to send to a search engine  Passage Retrieval  Retrieve ranked documents  Break into suitable passages and re-rank  Answer Processing  Extract candidate answers (as named entities)  Rank candidates  using evidence from relations in the text and external sources

Hybrid approaches (IBM Watson)  Build a shallow semantic representation of the query  Generate answer candidates using IR methods  Augmented with ontologies and semi-structured data  Score each candidate using richer knowledge sources  Geospatial databases  Temporal reasoning  Taxonomical classification 34

What’s been happening now?  Google Knowledge Graph  Facebook Graph Search  Bing’s Satori  Things like Wolfram Alpha Common theme: Doing graph search over structured knowledge rather than traditional text search  Two Goals:  Things not strings  Inference not search

Example from Fernando Pereira (Google)

Direct Answer Structured Data

Desired experience: Towards actions (Patrick Pantel, et al., Microsoft Research)

Desired experience: Towards actions

Actions vs. Intents

Learning actions from web usage logs

The Facebook Graph  Collection of entities and their relationships  Entities (users, pages, photos, etc.) are nodes  Relationships (friendship, checkins, tagging, etc.) are edges  Nodes and edges have metadata  Nodes have a unique id – the fbid

Facebook Graph Search

fbid: type: PAGE name: Breville mission: To design the best … … fbid: type: USER name: Sriram Sankar … LIKES FRIEND PHOTO TAGGED EVENT Facebook Graph Snippet

Social search/QA

Facebook Graph Search  Uses a weighted context free grammar (WCFG) to represent the Graph Search query language:  [start] => [users] $1  [users] => my friend friends(me)  [users] => friends of [users] friends($1)  [users] => {user} $1  [start] => [photos] $1  [photos] => photos of [users]photos($1)  A terminal symbol can be an entity, e.g., {user}, {city}, {employer}, {group}; it can also be a word/phrase, e.g., friends, live in, work at, members, etc. A parse tree is produced by starting from [start] and expanding the production rules until it reaches terminal symbols.