Download presentation
Presentation is loading. Please wait.
Published byAugusta Perry Modified over 9 years ago
1
(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003
2
(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Mondays, 1-4 PM in 409 West Hall
3
(C) 2003, The University of Michigan3 Schedule Readings for 03/31: –Chakrabarti, van den Berg, and Dom “Focused Crawling” WWW 1999 –Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 –Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002
4
(C) 2003, The University of Michigan4 Schedule March 24 –The link-content hypothesis –XML retrieval March 31 –Information extraction –Language reuse April 7 –Language modeling for IR –The Lemur system
5
(C) 2003, The University of Michigan5 Schedule HW3 assigned 03/24 HW3 due 04/07 Final projects due 04/11 Final project presentations 04/14 Final exam 04/21 2-3 essay questions, 2-3 problems
6
(C) 2003, The University of Michigan6 The link-content hypothesis
7
(C) 2003, The University of Michigan7 Kleinberg and Lawrence, The structure of the Web - Science 294 1849-1850 Web structure
8
(C) 2003, The University of Michigan8 Web structure 16-20 links on average The fraction of pages with n in-links is approximately n - for ~ 2.1 Kleinberg/Lawrence: 100,000 coherent communities (e.g., people concerned with oil spills off the coast of Japan)
9
(C) 2003, The University of Michigan9 Topical locality [Davison 00] Most web pages are linked to others with related content - this helps users navigate the Web. Presence of topical locality - important for building focused crawlers. Traditionally search engines only indexed titles and/or the first few lines of each document. Now, they index all links. “More evil than Satan himself”
10
(C) 2003, The University of Michigan10 Experimental design Local crawl of 100,000 pages Starts from HotBot and AltaVista Biased towards English-language pages From each page, retrieve one outgoing link per page.
11
(C) 2003, The University of Michigan11 TFIDF cosine similarity
12
(C) 2003, The University of Michigan12 Other metrics Query-document overlap Query term probability
13
(C) 2003, The University of Michigan13 Experimental results 100,000 URLs but only 89,891 retrievable An additional 111,107 URLs: two children per initial page www.geocities.com (561), www.webring.com(419), www.amazon.com(303), etc. 18% top-level pages 50%.com, 27%.edu
14
(C) 2003, The University of Michigan14 Textual similarity TFIDF similarity –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random
15
(C) 2003, The University of Michigan15 Structure and content [Menczer 01] Cluster hypothesis (van Rijsbergen 79) Link-cluster conjecture (Menczer) - preservation of semantics across link
16
(C) 2003, The University of Michigan16 Experimental design Open directory project (dmoz.org) 896,233 URLs from 97,614 topics 150,000 URLs from 47,174 topics 10,000 from each of the 15 top-level branches
17
(C) 2003, The University of Michigan17 Measures of similarity Cosine Link similarity Semantic similarity lca c2c2 c1c1
18
(C) 2003, The University of Michigan18 Correlations between similarities Over 3.84x10 9 pairs Highest for News and Home ( > 0.2) Lowest for Arts and Games ( < 0.05)
19
(C) 2003, The University of Michigan19 Fit 1 =1.8, 2 =0.6,
20
(C) 2003, The University of Michigan20 Document closures for Q&A capital P LP Madrid spain capital
21
(C) 2003, The University of Michigan21 Document closures for IR Physics P LP Physics Department University of Michigan
22
(C) 2003, The University of Michigan22 The perltree experiments 23.6% of the Excite log (2.5 M queries) –60% have both words in WordNet –27% have one word in WordNet –13% have no words in WordNet 200 queries from the log 200 random queries
23
(C) 2003, The University of Michigan23 Two-word queries jimi SAT seats david caesar poker cruise yellow science Tishara trim yankee witnesses naked swaybar cheats rides Precious drugs university Clock engines metal choreography anthony swinging psychoanalysis webdesign pic lens toys online speech therapy Malcolm McDowell cellular accessories migrant farmworkers witch tv davis instruments Adult Games chichen itza freighter Cruises used motorcycles feng shui revolucion mexicana zeebrugee belgium electronic greetings
24
(C) 2003, The University of Michigan24 Query analysis Words: –Familiarity –Ambiguity –IDF Queries; –GoogleSize –SemDist –DistribSim
25
(C) 2003, The University of Michigan25 Query analysis Fam1Fam2Amb1Amb2IDF1IDF2GsizeSemDDistS Excite (E)1.421.891.702.364.004.74670,0000.390.06 Random (R)1.541.612.062.294.404.55329,0000.290.02
26
(C) 2003, The University of Michigan26 Link-based language models Wt2g corpus 247,491 pages 3,118,248 links 948,036 unique words
27
(C) 2003, The University of Michigan27
28
(C) 2003, The University of Michigan28 Procedure Given a query q 1 q 2 –Get top 50 hits from Altavista (A) –Extract links that contain q 1 or q 2 –Get pages that are linked (B) –Extract links from A U B that point to A U B –Index A U B using glimpse –Compute link fertility
29
(C) 2003, The University of Michigan29 Results New links pointing to pages that were not in the AltaVista top 50 –E = +11.7%, R = +8.9% Improvements higher for –rarer words –lower distributional similarity –lower semantic distance
30
(C) 2003, The University of Michigan30 Topic distillation [Chakrabarti et al. 01] Topic drift Returning snippets rather than full documents Clique attacks (www.411fun.com, www.411fashion.com, www.411loans.com)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.