Mining the Web Ch 3. Web Search and Information Retrieval 인공지능연구실 박대원
2 Contents What is IR Queries & Inverted Index Relevance Ranking Similarity Search
3 What is IR? IR (Information Retrieval) –Prepare a keyword index for the given corpus –Respond to keyword queries with ranked list of documents Web Search Engine –Based on IR system –Given corpus : Web
4 Queries & Inverted Index 질의 (Query) – 단어의 나열 찾고자 하는 문서를 대표할 수 있는 단어를 나열 –Boolean Query (typical query) Expression with terms and Boolean operator Examples –“Java” or “API”, “Java” and “island”, “Java” not “coffee” –Proximity query Term 의 위치 정보를 이용한 질의 예 ) phrase “java beans”, “java” and “island” in the same sentence
5 Queries & Inverted Index ‘document-term’ relation – 문서 중심 – 문서 (document) : 단어들로 구성 ‘ 문서 내의 단어는 문서의 내용을 대표한다.’ !!! 원하는 문서를 찾으려면 ?
6 Queries & Inverted Index Inverted Index –‘term-document’ relation – 단어 중심 –Posting File 단어 위치 포함 – 색인 문서에서 단어를 추출하는 과정 필요
7 Queries & Inverted Index Indexing –Stopwords & stemming Stopwords – 예 : a, an, the, of, with 등 ( 영어 ); 조사, 어미 등 ( 한국어 ) –Stopwords 제거 »reduce index space »May reduce recall (in phrase search) » 예 : “to be or not to be” Stemming –match a query term with a morphological variant – 예 ) gains, gaining -> gain ; went, goes -> go
8 Queries & Inverted Index Indexing –Batch indexing and update Changing index Indexing/updating uses 2 indices –Index compression Use data compression methods – 예 ) gamma code, delta code, Golomb code Gap xCoding Method Unary Golomb b=3 b=
9 Relevance Ranking Evaluation of IR –Recall 관련 있는 문서가 검색된 비율 –Precision 검색된 문서 중 관련 있는 문서의 비율
10 Relevance Ranking Vector-space model D1D1 D2D2 Bit vector capturing essence/meaning of D 1 Query V1V1 V2V2 Q1Q1 Find max Sim (V i, Q 1 ) Sim (V 1, Q 1 ) _____________ Sim (V 2, Q 1 )
11 Relevance Ranking Vector Space Model –Documents are represented as vectors –Term weight : tf*idf tf : term frequency idf : inverse document frequency –Cosine measure Sim(D,Q) =
12 Relevance Ranking Relevance Feedback –Average web query : two words long Insufficient words –modify queries by adding or negating additional keywords. –Relevance feedback Query refinement process Rocchio’s method D+ : relevant documents, D- : irrelevant documents
13 Relevance Ranking Probabilistic Relevance Feedback Models –Probabilistic models to estimate the relevance of documents –odds ratio for relevance Require too much effort –Bayesian inference network (chapter 5) Represented by the directed acyclic graphs having document, representation and concept layers of nodes Require manual mapping of terms to concepts
14 Relevance Ranking Advanced Issues (Issues that need to be handled by the hypertext search engines) –Spamming Terms unnoticed by human, being noted by search engines Eliminate spam words by font color, position, repetition… Hyperlink-based ranking technique –Titles, headings, metatags, and anchor text No distinction for titles, headings, metatags, or anchors Web pages 의 구조화된 정보 이용 anchor-text 이용
15 Relevance Ranking Advanced Issues (Issues that need to be handled by the hypertext search engines) –Ranking for complex queries including phrases Phrase dictionary Term 의 문서 ( 문장 ) 내 위치 이용 –Approximate string matching 부분적으로 일치된 단어 검색 N-gram 이용 –Meta-search systems
16 Similarity Search Web data problem –Page replication, site mirroring, archived data, etc Handling “Find-Similar” Queries –“find-similar” ( 유사 문서 검색 ) Given a “query” document d q, find some small number of documents d from the corpus D having the largest value of d q · d Similarity measure : Jaccard coefficient
17 Similarity Search Eliminating Near Duplicates via Shingling –Comparing checksums of entire pages Maintain a checksum with every page in the corpus Detect replicated documents (depending on exact equality of checksum) –Measuring the dissimilarity between pages : edit distance Time-consuming work, impractical (all pairs of documents) –q-gram or shingle Contiguous subsequence of tokens taken from a document S(d,w) : set of distinct shingles of width w in document d
18 Similarity Search Detecting Locally Similar Subgraphs of the Web (chapter 7) –Collapsing locally similar Web subgraphs can improve hyperlink- assisted ranking –Approaches to detecting mirrored sites Approach 1 –Suspected duplicates are reduced to a sequence of outlinks with all Href strings converted to a canonical form –Cleaned URLs assigned unique token IDs are listed and sorted to find duplicates or near-duplicates
19 Similarity Search Detecting Locally Similar Subgraphs of the Web (chapter 7) –Approaches to detecting mirrored sites Approach 2 –Use regularity within URL strings to identify host pairs »Convert host and path to all lowercase characters »Let any punctuation or digit sequence be a token separator »Tokenize the URL into a sequence of tokens for example, www5.infoseek.com -> www, infoseek, com »Eliminate stop terms such as htm, html, txt, cgi, main, index, home,, »Form positional bigrams from the token sequence for example, ‘/cell-block16/inmates/dilbert/personal/foo.htm’ -> (cellblock,inmates,0),(inmates,dilbert,1),(dilbert,personal,2),(perso nal,foo,3) –Using “find-similar” algorithm