Introduction to Search Engines
Search Engine Overview Query (질의) 1 Searchable Index (색인) Search Results 2 3 Search Data (0) (1) Query Indexing (2) Document Ranking (3) Result Display 1. Document Collection - e.g., spider/crawler 2. Document Indexing - term indexing (tokenizing, stop & stem) - term weighting USER: Has information need 1. Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2. Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3. Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information 1. Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2. Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3. Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO redo #1; YES redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need User Intermediary Information What am I looking for? - Identification of info. need What question do I ask? - Query formulation What is the searcher looking for? - Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information? - Query-document matching What data to collect? - Collection development What information to index? - Indexing/Representation How to represent it? - Data structure Search Engines
Search Engine: Data Document Collection Document Indexing Select target data sources – e.g., domain, corpus, WWW Harvest data – e.g., data entry, data import, spider/crawler Document Indexing Select indexing sources (색인어) – e.g., metadata, keywords, content Extract indexing terms – e.g., tokenization, stop & stem Assign term weights – e.g., tf-idf, okapi “The frequency of word occurrence in an article furnishes a useful measurement of word significance.” 문헌에 출현한 던어들은 문헌의 내용 분석을 위해 사용될 수 있으며, 단어의 출현빈도가 이 단어의 주제어로서의 중요성을 측정하는 기준이 된다 . Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165. Search Engines
Search Engine: Indexing Process Documents (Text) INVERTED INDEX Term Weighting Tokenization Tokens Tokens SEQUENTIAL INDEX Tokens Token Selection Tokens Tokens Tokens Tokens Token Normalization Select Tokens D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) D1 information 1, retrieval 1, seminar 1 D2 information 1, model 1, retrieval 2 D3 information 1, model 1 D1: Information retrieval seminars D2: Retrieval Models and Information Retrieval D3: Information Model D1: information, retrieval, seminar(s) D2: retrieval, model(s), and, information, retrieval D3: information, model Search Engines
Search Engine: Search Query Indexing Document Ranking Result Display Tokenization Stop & Stem Term Weighting Document Ranking Query-Document matching Document Score computation Result Display Content - e.g., title & snippets Layout - e.g., grouped by category Toppings - e.g., related searches Query: What is information retrieval? Q: Information 1, retrieval 1 Index Term D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) Rank docID score 1 D2 3 2 D1 D3 Search Engines
2015 8 1 9 2 10 11 3 4 12 5 13 6 14 7 Search Engines
Result Categories 2015 15 16 17 Proprietary (Naver-specific) content Encyclopedia Naver Books Q&A DB (지식iN) Magazine Café Blog Book Map Website Advertisement (파워링크) Image Webpage Naver News Library Video Naver AppStore Naver Scholar Naver Post Naver Shopping News Naver Dictionary 15 16 17 Proprietary (Naver-specific) content Dynamic category order Toppings Search by Category Related Searches Popular Searches (by category) 18 Query: 정보검색 (Information Retrieval) Query: 검색엔진 (Search Engine) 19 20 Search Engines
Result Categories 2015 1 Webpage-centric content Advertisement 1 Webpage-centric content Dynamic category order Toppings Search by Category Related Searches 2 Query: Information Retrieval Query: Search Engine Search Engines
Search Engine vs. Database vs. Directories Corpus Type General Specific General/Specific Data Collection Automatic - crawler/spider Manual - data entry/import - classification Data Quality Not controlled Controlled Data Organization None (bag-of-words) Structured - Relational - Hierarchical Query Input Text box Field-specific - Boolean Search Result Ranked - documents Not ranked - records - categories Search Index Document text Database Tables Category Tree e.g. Google Library Search dmoz.org USER: Has information need 1. Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2. Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3. Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information 1. Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2. Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3. Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO redo #1; YES redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need Search Engines