Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

XML DOCUMENTS AND DATABASES
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Search Engines and Information Retrieval
Architecture of a Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Lecture 1: Web Search Overview & Web Crawling
1. Search Engines Architecture Azreen Azman, PhD SMM 5891 All slides ©Addison Wesley, 2008.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
ITGS Databases.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Information Retrieval (in Practice)
Clustering of Web pages
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Computer Literacy BASICS: A Comprehensive Guide to IC3, 3rd Edition
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to Search Engines
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Chapter 2 Architecture of a Search Engine

Search Engine Architecture n A software architecture consists of software components, the interfaces provided by those components and the relationships between them  Describes a system at a particular level of abstraction n Architecture of a search engine determined by two requirements  Effectiveness (quality of results)  Efficiency (response time and throughput) 2

Indexing Process 3 Identifies and stores documents for indexing Transforms documents into index terms or features Takes index terms and creates data structures (inverted indexes) to support fast searching Text + Meta data (Doc type, structure, features, size, etc.) - One of the two major functions of search engine components

Query Process 4 Supports creation/refinement of query, display of results Uses query and indexes to generate ranked list of documents Monitors and measures effectiveness and efficiency (primarily offline) using log data - Another major function of search engine components  Must be both efficient and effective

Details: Text Acquisition n Crawler  Identifies and acquires documents for search engines  Many types – Web, enterprise, desktop  Web crawlers follow links to find documents Must efficiently find huge numbers of web pages ( ) and keep them up-to-date ( ) Single site crawlers for site search Topical or focused crawlers for specific search  Document crawlers for enterprise and desktop search Follow links and scan directories 5 coverage freshness

Text Acquisition n Feeds  Real-time streams of documents e.g., Web feeds for news, blogs, video, radio, TV  RSS (Rich Site Summary) is a commonly-used web feed format (which has been standardized) n Conversion  Convert variety of documents into a consistent text plus metadata format e.g., HTML, Word, PDF, etc. → XML  Convert text encoding for different languages Using a Unicode standard like UTF-8 6

Text Transformation n Parser  Processing the sequence of text tokens (i.e., words) in the document to recognize structural elements e.g., titles, links, headings, etc.  Tokenizer recognizes “words” in the text (and queries) for comparison, a non-trivial process. Must consider issues like,,,,, etc.  Markup languages such as HTML and XML often used to specify structure Tags used to specify document elements, e.g., Overview Document parser uses syntax of markup language (or other formatting) to identify structure 7 capitalization hyphensapostrophes non-alpha characters separators

Text Transformation n Stopping  Remove common (function) words, e.g., “and”, “or”, “the”, “in”  Some impact on efficiency & effectiveness (reduce the size of indexes)  A problem for some queries, e.g., “to be or not to be” n Stemming  Group words derived from a common stem, e.g., “compute”, “computer”, “computers”, “computing”  Often effective (in terms of matching); not for all queries  Benefits vary for different languages (Arabic vs. Chinese) n Information Extraction  Identify classes of index terms, e.g., named entity recognizers, identify classes such as people, locations, companies & dates, using part-of-speech tagging 8

Index Creation n Document Statistics (collected during the indexing process)  Gathers word counts and positions of words and other features (e.g., length of documents as number of tokens)  Used in ranking algorithm (IR model dependent)  Stored in lookup tables for fast retrieval n Weighting (during the query process)  Computes weights (the relative importance) of index terms  Used in ranking algorithm (IR model dependent)  e.g., TF-IDF weight Combination of term frequency (TF) in document and inverse document frequency (IDF) in the collection 9

10 Index Creation n Inversion of word list, converting doc-term to term-doc Sort ab2 being2 charact 2 human2 index1 literat1 novel1 pap1 report1 report2 result1 technique1 : : Word Doc# Remove Duplicates ab2 1 being2 1 charact 21 human21 index11 literat11 novel11 pap11 report11 21 result11 technique12 : : Word Doc# Freq Word Doc# pap 1 report1 novel1 technique1 literat1 result1 technique1 index1 :: report2 charact2 human2 being2 ab2 : :

11 Term-Document Incidence Matrix n Matrix element (t, d) = 1, if term t in document d; 0, otherwise n Example. Documents Terms n Term-Term Correlation Matrix: M  M T, where M is a term- document matrix, M T is the transpose of M, and ‘  ’ is the matrix composition operator

Index Creation n Inversion  Core of indexing process  Converts document-term information to term-document for indexing Difficult for very large numbers of documents to achieve high efficiency (for initial setup and subsequent updates) Multiple-level indexing is desirable for very large number of indexes, e.g., B + -tree indexing  Format of inverted file is designed for fast query processing Must also handle updates, besides creation Compression used for efficiency 12

User Interaction n Query input  Provides user interface and parser for query language  Most web queries are very simple, such as keyword queries, other applications may use forms  Query language used to describe more complex queries and results of query transformation Boolean queries “Quotes” for phrase queries, indicating relationships among words For keyword searches, longer queries yield less results Similar to SQL language used in DB applications IR query languages focus on content  Goal: yields good (better) results for a range of (specific) queries 13

User Interaction n Query transformation  Performs text transformation on query text, e.g., stemming  Improves initial query, both before and after initial search  Spell checking/query suggestion, which provide alternatives (correcting spelling errors/specification) to the original query, is based on query logs  Modify the original query with additional terms Query expansion: provides new, similar terms to a query based on term occurrences in documents or query logs Relevance feedback: terms in previous retrieved relevant documents 14

User Interaction n Results output  Constructs the display of ranked documents for a query  Generates snippets to show how queries match documents  Highlights important words and passages  May provide clustering and other visualization tools 15

Ranking n Scoring  Calculates scores for documents using a ranking algorithm  Is a core component of search engine  Basic form of score is  q i d i where V is the vocabulary of the document collection q i & d i are query and document term weights, respectively, e.g., TF/IDF or term probability for term i  Many variations of ranking algorithms and retrieval models  Must be calculated very rapidly to achieve performance optimization 16 |V| i=1