Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Web search engines Rooted in Information Retrieval (IR) systems ARCHIE
Modeling Modern Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Web Search & Information Retrieval (Chap 3, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng.
Evaluating the Performance of IR Sytems
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Web search engines  Rooted in Information Retrieval (IR) systems Prepare a keyword index for corpus Respond to keyword queries with a ranked list of documents.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Web Search & Information Retrieval. Web search engines Rooted in Information Retrieval (IR) systems Rooted in Information Retrieval (IR) systems Prepare.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Chapter 23: Probabilistic Language Models April 13, 2004.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Web Search & Information Retrieval. Web search engines Rooted in Information Retrieval (IR) systems Rooted in Information Retrieval (IR) systems Prepare.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Web Search & Information Retrieval
Search Engine Architecture
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Multimedia Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원

2 Contents  What is IR  Queries & Inverted Index  Relevance Ranking  Similarity Search

3 What is IR?  IR (Information Retrieval) –Prepare a keyword index for the given corpus –Respond to keyword queries with ranked list of documents  Web Search Engine –Based on IR system –Given corpus : Web

4 Queries & Inverted Index  질의 (Query) – 단어의 나열 찾고자 하는 문서를 대표할 수 있는 단어를 나열 –Boolean Query (typical query) Expression with terms and Boolean operator Examples –“Java” or “API”, “Java” and “island”, “Java” not “coffee” –Proximity query Term 의 위치 정보를 이용한 질의 예 ) phrase “java beans”, “java” and “island” in the same sentence

5 Queries & Inverted Index  ‘document-term’ relation – 문서 중심 – 문서 (document) : 단어들로 구성 ‘ 문서 내의 단어는 문서의 내용을 대표한다.’ !!! 원하는 문서를 찾으려면 ?

6 Queries & Inverted Index  Inverted Index –‘term-document’ relation – 단어 중심 –Posting File 단어 위치 포함 – 색인 문서에서 단어를 추출하는 과정 필요

7 Queries & Inverted Index  Indexing –Stopwords & stemming Stopwords – 예 : a, an, the, of, with 등 ( 영어 ); 조사, 어미 등 ( 한국어 ) –Stopwords 제거 »reduce index space »May reduce recall (in phrase search) » 예 : “to be or not to be” Stemming –match a query term with a morphological variant – 예 ) gains, gaining -> gain ; went, goes -> go

8 Queries & Inverted Index  Indexing –Batch indexing and update Changing index Indexing/updating uses 2 indices –Index compression Use data compression methods – 예 ) gamma code, delta code, Golomb code Gap xCoding Method Unary  Golomb b=3 b=

9 Relevance Ranking  Evaluation of IR –Recall 관련 있는 문서가 검색된 비율 –Precision 검색된 문서 중 관련 있는 문서의 비율

10 Relevance Ranking  Vector-space model D1D1 D2D2 Bit vector capturing essence/meaning of D 1 Query V1V1 V2V2 Q1Q1 Find max Sim (V i, Q 1 ) Sim (V 1, Q 1 ) _____________ Sim (V 2, Q 1 )

11 Relevance Ranking  Vector Space Model –Documents are represented as vectors –Term weight : tf*idf tf : term frequency idf : inverse document frequency –Cosine measure Sim(D,Q) =

12 Relevance Ranking  Relevance Feedback –Average web query : two words long Insufficient words –modify queries by adding or negating additional keywords. –Relevance feedback Query refinement process Rocchio’s method D+ : relevant documents, D- : irrelevant documents

13 Relevance Ranking  Probabilistic Relevance Feedback Models –Probabilistic models to estimate the relevance of documents –odds ratio for relevance Require too much effort –Bayesian inference network (chapter 5) Represented by the directed acyclic graphs having document, representation and concept layers of nodes Require manual mapping of terms to concepts

14 Relevance Ranking  Advanced Issues (Issues that need to be handled by the hypertext search engines) –Spamming Terms unnoticed by human, being noted by search engines Eliminate spam words by font color, position, repetition… Hyperlink-based ranking technique –Titles, headings, metatags, and anchor text No distinction for titles, headings, metatags, or anchors Web pages 의 구조화된 정보 이용 anchor-text 이용

15 Relevance Ranking  Advanced Issues (Issues that need to be handled by the hypertext search engines) –Ranking for complex queries including phrases Phrase dictionary Term 의 문서 ( 문장 ) 내 위치 이용 –Approximate string matching 부분적으로 일치된 단어 검색 N-gram 이용 –Meta-search systems

16 Similarity Search  Web data problem –Page replication, site mirroring, archived data, etc  Handling “Find-Similar” Queries –“find-similar” ( 유사 문서 검색 ) Given a “query” document d q, find some small number of documents d from the corpus D having the largest value of d q · d Similarity measure : Jaccard coefficient

17 Similarity Search  Eliminating Near Duplicates via Shingling –Comparing checksums of entire pages Maintain a checksum with every page in the corpus Detect replicated documents (depending on exact equality of checksum) –Measuring the dissimilarity between pages : edit distance Time-consuming work, impractical (all pairs of documents) –q-gram or shingle Contiguous subsequence of tokens taken from a document S(d,w) : set of distinct shingles of width w in document d

18 Similarity Search  Detecting Locally Similar Subgraphs of the Web (chapter 7) –Collapsing locally similar Web subgraphs can improve hyperlink- assisted ranking –Approaches to detecting mirrored sites Approach 1 –Suspected duplicates are reduced to a sequence of outlinks with all Href strings converted to a canonical form –Cleaned URLs assigned unique token IDs are listed and sorted to find duplicates or near-duplicates

19 Similarity Search  Detecting Locally Similar Subgraphs of the Web (chapter 7) –Approaches to detecting mirrored sites Approach 2 –Use regularity within URL strings to identify host pairs »Convert host and path to all lowercase characters »Let any punctuation or digit sequence be a token separator »Tokenize the URL into a sequence of tokens for example, www5.infoseek.com -> www, infoseek, com »Eliminate stop terms such as htm, html, txt, cgi, main, index, home,, »Form positional bigrams from the token sequence for example, ‘/cell-block16/inmates/dilbert/personal/foo.htm’ -> (cellblock,inmates,0),(inmates,dilbert,1),(dilbert,personal,2),(perso nal,foo,3) –Using “find-similar” algorithm