2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙星计划课程 : 信息检索 Next-Generation Search Engines ChengXiang Zhai ( 翟成祥.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
Web Search Engines (Lecture for CS410 Text Info Systems)
Information Retrieval Review
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Pick a Good IR Research Problem ChengXiang Zhai Department of Computer.
Search Engines and Information Retrieval Chapter 1.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Frame an IR Research Problem and Form Hypotheses ChengXiang Zhai Department.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
An Overview of Information Retrieval Nov. 10, 2009 Maryam Karimzadehgan Department of Computer Science University of Illinois, Urbana-Champaign.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine Optimization
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval
Anatomy of a search engine
Web Search Engines.
Presentation transcript:

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Next-Generation Search Engines ChengXiang Zhai ( 翟成祥 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Outline Overview of web search Next generation search engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Characteristics of Web Information “Infinite” size (Surface vs. deep Web) –Surface = static HTML pages –Deep = dynamically generated HTML pages (DB) Semi-structured –Structured = HTML tags, hyperlinks, etc –Unstructured = Text Different format (pdf, word, ps, …) Multi-media (Textual, audio, images, …) High variances in quality (Many junks) “Universal” coverage (can be about any content)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, General Challenges in Web Information Management Handling the size of the Web –How to ensure completeness of coverage? –Efficiency issues Dealing with or tolerating errors and low quality information Addressing the dynamics of the Web –Some pages may disappear permanently –New pages are constantly created

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, “Free text” vs. “Structured text” So far, we’ve assumed “free text” –Document = word sequence –Query = word sequence –Collection = a set of documents –Minimal structure … But, we may have structures on text (e.g., title, hyperlinks) –Can we exploit the structures in retrieval?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Examples of Document Structures Intra-doc structures (=relations of components) –Natural components: title, author, abstract, sections, references, … –Annotations: named entities, subtopics, markups, … Inter-doc structures (=relations between documents) –Topic hierarchy –Hyperlinks/citations (hypertext)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Structured Text Collection... Subtopic 1 Subtopic k A general topic General question: How do we search such a collection?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Exploiting Intra-document Structures [Ogilvie & Callan 2003] Title Abstract Body-Part1 Body-Part2 … D D1D1 D2D2 D3D3 DkDk Intuitively, we want to combine all the parts, but give more weights to some parts Think about query-likelihood model… “part selection” prob. Serves as weight for D j Can be trained using EM Select D j and generate a query word using D j Anchor text can be treated as a “part” of a document

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Exploiting Inter-document Structures Document collection has links (e.g., Web, citations of literature) Query: text query Results: ranked list of documents Challenge: how to exploit links to improve ranking?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Exploiting Inter-Document Links Description (“anchor text”) Hub Authority “Extra text”/summary for a doc Links indicate the utility of a doc What does a link tell us?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, PageRank: Capturing Page “Popularity” [Page & Brin 98] Intuitions –Links are like citations in literature –A page that is cited often can be expected to be more useful in general PageRank is essentially “citation counting”, but improves over simple counting –Consider “indirect citations” (being cited by a highly cited paper counts a lot…) –Smoothing of citations (every page is assumed to have a non- zero citation count) PageRank can also be interpreted as random surfing (thus capturing popularity)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, The PageRank Algorithm (Page et al. 98) d1d1 d2d2 d4d4 “Transition matrix” d3d3 Iterate until converge N= # pages Stationary (“stable”) distribution, so we ignore time Random surfing model: At any page, With prob. , randomly jumping to a page With prob. (1-  ), randomly picking a link to follow. I ij = 1/N Initial value p(d)=1/N

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, PageRank in Practice Interpretation of the damping factor  (  0.15): –Probability of a random jump –Smoothing the transition matrix (avoid zero’s) Normalization doesn’t affect ranking, leading to some variants The zero-outlink problem: p(di)’s don’t sum to 1 –One possible solution = page-specific damping factor (  =1.0 for a page with no outlink)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, HITS: Capturing Authorities & Hubs [Kleinberg 98] Intuitions –Pages that are widely cited are good authorities –Pages that cite many other pages are good hubs The key idea of HITS –Good authorities are cited by good hubs –Good hubs point to good authorities –Iterative reinforcement…

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, The HITS Algorithm [Kleinberg 98] d1d1 d2d2 d4d4 “Adjacency matrix” d3d3 Initial values: a(d i )=h(d i )=1 Iterate Normalize:

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Basic Search Engine Technologies Cached pages Crawler Web ---- … ---- … ---- … Indexer (Inverted) Index … Retriever Browser Query Host Info. Results User Efficiency!!! Coverage Freshness Precision Error/spam handling

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Component I: Crawler/Spider/Robot Building a “toy crawler” is easy –Start with a set of “seed pages” in a priority queue –Fetch pages from the web –Parse fetched pages for hyperlinks; add them to the queue –Follow the hyperlinks in the queue A real crawler is much more complicated… –Robustness (server failure, trap, etc.) –Crawling courtesy (server load balance, robot exclusion, etc.) –Handling file types (images, PDF files, etc.) –URL extensions (cgi script, internal references, etc.) –Recognize redundant pages (identical and duplicates) –Discover “hidden” URLs (e.g., truncated) Crawling strategy is a main research topic (i.e., which page to visit next?)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Major Crawling Strategies Breadth-First (most common(?); balance server load) Parallel crawling Focused crawling –Targeting at a subset of pages (e.g., all pages about “automobiles” ) –Typically given a query Incremental/repeated crawling –Can learn from the past experience –Probabilistic models are possible The Major challenge remains to maintain “freshness” and good coverage with minimum resource overhead

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Component II: Indexer Standard IR techniques are the basis –Basic indexing decisions (stop words, stemming, numbers, special symbols) –Indexing efficiency (space and time) –Updating Additional challenges –Recognize spams/junks –Exploit multiple features (PageRank, font information, structures, etc) –How to support “fast summary generation”? Google’s contributions: –Google file system: distributed file system –Big Table: column-based database –MapReduce: Software framework for parallel computation –Hadoop: Open source implementation of MapReduce (mainly by Yahoo!)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Google’s Basic Solutions URL Queue/List Cached source pages (compressed) Inverted index Hypertext structure Use many features, e.g. font, layout,…

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Component III: Retriever Standard IR models are applicable but insufficient –Different information need (home page finding vs. topic-driven) –Documents have additional information (hyperlinks, markups, URL) –Information is often redundant and the quality varies a lot –Server-side feedback is often not feasible Major extensions –Exploiting links (anchor text, link-based scoring) –Exploiting layout/markups (font, title field, etc.) –Spelling correction –Spam filtering –Redundancy elimination In general, rely on machine learning to combine all kinds of features

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Effective Web Retrieval Heuristics High accuracy in home page finding can be achieved by –Matching query with the title –Matching query with the anchor text –Plus URL-based or link-based scoring (e.g. PageRank) Imposing a conjunctive (“and”) interpretation of the query is often appropriate –Queries are generally very short (all words are necessary) –The size of the Web makes it likely that at least a page would match all the query words Combine multiple features using machine learning

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Home/Entry Page Finding Evaluation Results (TREC 2001) MRR%top10%fail Unigram Query Likelihood + Link/URL prior i.e., p(Q|D) p(D) [Kraaij et al. SIGIR 2002] Exploiting anchor text, structure or links Query example: Haas Business School

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Named Page Finding Evaluation Results (TREC 2002) Dirichlet Prior + Title, Anchor Text (Lemur) [Ogilvie & Callan SIGIR 2003] Okapi/BM25 + Anchor Text Best content-only Query example: America’s century farms

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Learning Retrieval Functions Basic idea: –Given a query-doc pair (Q,D), define various kinds of features Fi(Q,D) –Examples of feature: the number of overlapping terms, p(Q|D), PageRank of D, p(Q|Di), where Di may be anchor text or big font text –Hypothesize p(R=1|Q,D)=s(F1(Q,D),…,Fn(Q,D), ) where is parameters –Learn by fitting function s with training data (i.e., (d,q)’s where d is known to be relevant or non-relevant to q) Methods: –Early work: logistic regression [Cooper 92, Gey 94] –Recent work: Ranking SVM [Joachims 02], RankNet [Burges et al. 05]

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Learning to Rank Advantages –May combine multiple features (helps improve accuracy and combat web spams) –May re-use all the past relevance judgments (self- improving) Problems –No much guidance on feature generation (rely on traditional retrieval models) All current Web search engines use some kind of learning algorithms to combine many features

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Limitations of the Current Search Engines Limited query language –Syntactic querying (sense disambiguation?) –Can’t express multiple search criteria (readability?) Limited understanding of document contents –Bag of words & keyword matching (sense disambiguation?) Heuristic query-document matching: mostly TF-IDF weighting –No guarantee for optimality –Machine learning can combine many features, but content matching remains the most important component in scoring

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Limitations of the Current Search Engines (cont.) Lack of user/context modeling –Using the same query, different users would get the same results (poor user modeling) –The same user may use the same query to find different information at different times Inadequate support for multiple-mode information access –Passive search support: A user must take initiative (no recommendation) –Static navigation support: No dynamically generated links –Consider more integration of search, recommendation, and navigation Lack of interaction Lack of task support

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Towards Next-Generation Search Engines Better support for query formulation –Allow querying from any task context –Query by examples –Automatic query generation (recommendation) Better search accuracy –More accurate information need understanding (more personalization and context modeling) –More accurate document content understanding (more powerful content analysis, sense disambiguation, sentiment analysis, …) More complex retrieval criteria –Consider multiple utility aspects of information items (e.g., readability, quality, communication cost) –Consider collective value of information items (context-sensitive ranking)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Towards Next-Generation Search Engines (cont.) Better result presentation –Better organization of search results to facilitate navigation –Better summarization More effective and robust retrieval models –Automatic parameter tuning More interactive search More task support

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Looking Ahead…. More user modeling –Personalized search –Community search engine (collaborative information access) More content analysis and domain modeling –Vertical search engines –More in-depth (domain-specific) natural language understanding –Text mining More accurate retrieval models (life-time learning) Going beyond search –Towards full-fledge information access: integration of search, recommendation, and navigation –Towards task support: putting information access in the context of a task

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Summary Web provides many challenges and opportunities for text information management Search engine technology  crawling + retrieval models + machine learning + software engineering Current generation of search engines are limited in user modeling, content understanding, retrieval model… Next generation of search engines likely moves toward personalization, domain-specific vertical search engines, collaborative search, task support, …

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, What You Should Know Special characteristics of Web information (as compared with ordinary text collection) Two kinds of structures of a text collection (intra-doc and inter-doc) Basic ideas of PageRank and HITS How a web search engine works Limitations of current search engines