Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Search Engines

Similar presentations


Presentation on theme: "Introduction to Search Engines"— Presentation transcript:

1 Introduction to Search Engines

2 Search Engine Overview
Query (질의) 1 Searchable Index (색인) Search Results 2 3 Search Data (0) (1) Query Indexing (2) Document Ranking (3) Result Display 1. Document Collection - e.g., spider/crawler 2. Document Indexing - term indexing (tokenizing, stop & stem) - term weighting USER: Has information need 1.     Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2.     Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3.     Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information  1.     Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2.     Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3.     Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO  redo #1; YES  redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need  User Intermediary Information What am I looking for? - Identification of info. need What question do I ask? - Query formulation What is the searcher looking for? - Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information? - Query-document matching What data to collect? - Collection development What information to index? - Indexing/Representation How to represent it? - Data structure Search Engines

3 Search Engine: Data Document Collection Document Indexing
Select target data sources – e.g., domain, corpus, WWW Harvest data – e.g., data entry, data import, spider/crawler Document Indexing Select indexing sources (색인어) – e.g., metadata, keywords, content Extract indexing terms – e.g., tokenization, stop & stem Assign term weights – e.g., tf-idf, okapi “The frequency of word occurrence in an article furnishes a useful measurement of word significance.” 문헌에 출현한 던어들은 문헌의 내용 분석을 위해 사용될 수 있으며, 단어의 출현빈도가 이 단어의 주제어로서의 중요성을 측정하는 기준이 된다 . Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, Search Engines

4 Search Engine: Indexing Process
Documents (Text) INVERTED INDEX Term Weighting Tokenization Tokens Tokens SEQUENTIAL INDEX Tokens Token Selection Tokens Tokens Tokens Tokens Token Normalization Select Tokens D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) D1 information 1, retrieval 1, seminar 1 D2 information 1, model 1, retrieval 2 D3 information 1, model 1 D1: Information retrieval seminars D2: Retrieval Models and Information Retrieval D3: Information Model D1: information, retrieval, seminar(s) D2: retrieval, model(s), and, information, retrieval D3: information, model Search Engines

5 Search Engine: Search Query Indexing Document Ranking Result Display
Tokenization Stop & Stem Term Weighting Document Ranking Query-Document matching Document Score computation Result Display Content - e.g., title & snippets Layout - e.g., grouped by category Toppings - e.g., related searches Query: What is information retrieval? Q: Information 1, retrieval 1 Index Term D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) Rank docID score 1 D2 3 2 D1 D3 Search Engines

6 2015 8 1 9 2 10 11 3 4 12 5 13 6 14 7 Search Engines

7 Result Categories 2015 15 16 17 Proprietary (Naver-specific) content
Encyclopedia Naver Books Q&A DB (지식iN) Magazine Café Blog Book Map Website Advertisement (파워링크) Image Webpage Naver News Library Video Naver AppStore Naver Scholar Naver Post Naver Shopping News Naver Dictionary 15 16 17 Proprietary (Naver-specific) content Dynamic category order Toppings Search by Category Related Searches Popular Searches (by category) 18 Query: 정보검색 (Information Retrieval) Query: 검색엔진 (Search Engine) 19 20 Search Engines

8 Result Categories 2015 1 Webpage-centric content
Advertisement 1 Webpage-centric content Dynamic category order Toppings Search by Category Related Searches 2 Query: Information Retrieval Query: Search Engine Search Engines

9 Search Engine vs. Database vs. Directories
Corpus Type General Specific General/Specific Data Collection Automatic - crawler/spider Manual - data entry/import - classification Data Quality Not controlled Controlled Data Organization None (bag-of-words) Structured - Relational - Hierarchical Query Input Text box Field-specific - Boolean Search Result Ranked - documents Not ranked - records - categories Search Index Document text Database Tables Category Tree e.g. Google Library Search dmoz.org USER: Has information need 1.     Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2.     Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3.     Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information  1.     Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2.     Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3.     Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO  redo #1; YES  redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need  Search Engines


Download ppt "Introduction to Search Engines"

Similar presentations


Ads by Google