Architecture of a Search Engine

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

XML DOCUMENTS AND DATABASES
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Information Retrieval in Practice
Search Engines and Information Retrieval
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information Retrieval
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Lecture 1: Web Search Overview & Web Crawling
1. Search Engines Architecture Azreen Azman, PhD SMM 5891 All slides ©Addison Wesley, 2008.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
ITGS Databases.
Algorithmic Detection of Semantic Similarity WWW 2005.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
1 Information Retrieval LECTURE 1 : Introduction.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Information Retrieval in Practice
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Information Retrieval (in Practice)
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to Search Engines
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Architecture of a Search Engine Chapter 2 Architecture of a Search Engine

Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them Describes a system at a particular level of abstraction Architecture of a search engine determined by two requirements Effectiveness (quality of results) and efficiency (response time and throughput)

Indexing Process Takes index terms and creates data structures (indexes) to support fast searching Identifies and stores documents for indexing Transforms documents into index terms or features

Query Process Supports creation/refinement Uses query and indexes of query, display of results Uses query and indexes to generate ranked list of documents Monitors and measures effectiveness and efficiency (primarily offline)

Details: Text Acquisition Crawler Identifies and acquires documents for search engines Many types – Web, enterprise, desktop Web crawlers follow links to find documents Must efficiently find huge numbers of web pages ( ) and keep them up-to-date ( ) Single site crawlers for site search Topical or focused crawlers for image search Document crawlers for enterprise and desktop search Follow links and scan directories coverage freshness

Text Acquisition Feeds Conversion Real-time streams of documents e.g., Web feeds for news, blogs, video, radio, TV RSS is common standard RSS “reader” can provide new XML docs to search engine Conversion Convert variety of documents into a consistent text plus metadata format e.g., HTML, Word, PDF, etc. → XML Convert text encoding for different languages Using a Unicode standard like UTF-8

Text Transformation Parser Processing the sequence of text tokens (i.e., words) in the document to recognize structural elements e.g., titles, links, headings, etc. Tokenizer recognizes “words” in the text (and queries) for comparison, a non-trivial process. Must consider issues like , , , , Markup languages such as HTML and XML often used to specify structure Tags used to specify document elements, e.g., <h2>Overview</h2> Document parser uses syntax of markup language (or other formatting) to identify structure capitalization hyphens apostrophes non-alpha characters separators

Text Transformation Stopping Stemming Remove common (function) words e.g., “and”, “or”, “the”, “in” Some impact on efficiency and effectiveness (reduce the size of indexes) A problem for some queries, e.g., “to be or not to be” Stemming Group words derived from a common stem e.g., “computer”, “computers”, “computing”, “compute” Often effective (in terms of matching); not for all queries Benefits vary for different languages (Arabic vs. Chinese)

Text Transformation Link Analysis Use links and anchor text in web pages for indexing Link analysis identifies popularity and community information for ranking purpose e.g., PageRank Anchor text can significantly enhance the representation of pages pointed to by links Significant impact on web search—hub and authority Less importance in other applications

Text Transformation Information Extraction Classifier Identify classes of index terms that are important for some applications e.g., named entity recognizers identify classes such as people, locations, companies, dates, etc., using part- of-speech tagging Classifier Identifies class-related metadata for documents i.e., assigns labels to documents e.g., topics, reading levels, sentiment, genre Use depends on application

Index Creation Document Statistics (collected during the indexing process) Gathers word counts and positions of words and other features (e.g., length of documents as number of tokens) Used in ranking algorithm (IR model dependent) Stored in lookup tables Weighting (during the query process) Computes weights for index terms e.g., TF-IDF weight Combination of term frequency in document and inverse document frequency in the collection

Index Creation Example. Inversion of word list : : Word Doc# Freq ab 2 1 being 2 1 charact 2 1 human 2 1 index 1 1 literat 1 1 novel 1 1 pap 1 1 report 1 1 2 1 result 1 1 technique 1 2 : : Word Doc# Freq Word Doc# pap 1 report 1 novel 1 technique 1 literat 1 result 1 index 1 : : report 2 charact 2 human 2 being 2 ab 2 : : ab 2 being 2 charact 2 human 2 index 1 literat 1 novel 1 pap 1 report 1 report 2 result 1 technique 1 : : Word Doc# Remove Duplicates Sort

Term-Document Incidence Matrix Matrix element (t, d) = 1, if term t in document d; 0, otherwise Example. Documents Terms Term-Term Correlation Matrix: M  MT, where M is a term- document matrix, MT is the transpose of M, and ‘’ is the matrix composition operator

Index Creation Inversion Core of indexing process Converts document-term information to term-document for indexing Difficult for very large numbers of documents to achieve high efficiency (for initial setup and subsequent updates) Multiple-level indexing is desirable for very large number of indexes, e.g., B+-tree indexing Format of inverted file is designed for fast query processing Must also handle updates, besides creation Compression used for efficiency

B+-Tree (Multi-level) Indices Frequently used index structure in DB Allow efficient updates of new/existing search-key values A balanced tree structure: all leaf nodes are at the same level In each leaf node, Pi points to either (i) a data record with search-key value Ki or (ii) a bucket of pointers, each points to a data record with search-key value Ki X X < K1 Ki-1 X  Ki Kn-1 X P1 K1 … Ki-1 Pi Ki … Kn-1 Pn

Index Creation Index Distribution Distributes indexes across multiple computers and/or multiple sites (a decentralization approach) Essential for fast query processing with large number of documents Many variations for parallel processing Document distribution (distribute indexes for a subset of documents for indexing + query processing) Term distribution (distribute indexes for a subset of terms for query processing) Replication (copies of indexes + documents are stored at multiple sites) P2P and distributed IR involve search across multiple sites

User Interaction Query input Provides interface and parser for query language Most web queries are very simple, such as keyword queries, other applications may use forms Query language used to describe more complex queries and results of query transformation Boolean queries “Quotes” for phrase queries, indicating relationships among words For keyword searches, longer queries yield less results Similar to SQL language used in DB applications IR query languages allow structure specification, but focus on content Goal: yields good (better) results for a range of (specific) queries

User Interaction Query transformation Improves initial query, both before and after initial search Includes the text transformation techniques used for documents, i.e., tokenizing, stopping, stemming Spell checking/query suggestion, which provide alternatives (correcting spelling errors/specification) to the original query, is based on query logs Modify the original query with additional terms Query expansion: provides additional terms to a query based on term occurrences in documents or query logs Relevance feedback: terms in previous retrieved relevant documents

User Interaction Results output Constructs the display of ranked documents for a query Generates snippets to show how queries match documents Highlights important words and passages May provide clustering and other visualization tools

Ranking Scoring Calculates scores for documents using a ranking algorithm Is a core component of search engine Basic form of score is  qi di where V is the vocabulary of the document collection qi and di are query and document term weights, e.g., TF/IDF or term probability for term i Many variations of ranking algorithms and retrieval models Must be calculated very rapidly to achieve performance optimization |V| i=1

Ranking Performance optimization Distribution Designing ranking algorithms for efficient processing Decrease query response time & increase system throughput Term-at-a time (matching each query term with terms in text) Document-at-a-time (using indexes to match all the terms at once) Using top-ranked documents: safe vs. unsafe optimizations Distribution Processing queries in a distributed environment Query broker distributes queries and assembles results Caching is a form of distributed searching

Evaluation Logging Ranking analysis Performance analysis Logging user queries and interaction is crucial for improving search effectiveness and efficiency Query logs and click through data (and dwell time) used for spell checking, query suggestion/expansion, query caching, click-through data, dwell time, etc. based on top-ranking documents Ranking analysis Measuring and tuning ranking effectiveness using log data Performance analysis Measuring and tuning system efficiency based on response time and throughput Real time or simulations (replacing actual components by mathematical models)