Download presentation
Presentation is loading. Please wait.
1
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR
2
2 T.Sharon - A.Frank Web IR What’s Different about Web IR? Web IR Queries How to Compare Web Search Engines? The ‘HITS’ Scoring Method
3
3 T.Sharon - A.Frank What’s different about the Web? Bulk ……………... (500M); growth at 20M/month Lack of Stability..… Estimates: 1%/day--1%/week Heterogeneity –Types of documents …. text, pictures, audio, scripts... –Quality –Document Languages ……………. 100+ Duplication Non-running text High Linkage..…………. 8 links/page average > =
4
4 T.Sharon - A.Frank Taxonomy of Web Document Languages SGML HyTime XML Metalanguages Languages SMILMathMLRDFXHTML HTMLTEI Lite DSSSL XSL CSS Style sheets
5
5 T.Sharon - A.Frank Non-running Text
6
6 T.Sharon - A.Frank What’s different about the Web Users? Make poor queries –short (2.35 terms average) –imprecise terms –sub-optimal syntax (80% without operators) –low effort Wide variance on –Needs –Expectations –Knowledge –Bandwidth Specific behavior –85% look over one result screen only –78% of queries not modified
7
7 T.Sharon - A.Frank Why don’t the Users get what they Want? User need User request (verbalized) Query to IR system Results Translation problems Polysemy Synonymy Example I need to get rid of mice in the basement What ’ s the best way to trap mice alive? Mouse trap Computer supplies software, etc
8
8 T.Sharon - A.Frank Alta Vista: Mouse trap
9
9 T.Sharon - A.Frank Alta Vista: Mice trap
10
10 T.Sharon - A.Frank Challenges on the Web Distributed data Dynamic data Large volume Unstructured and redundant data Data quality Heterogeneous data
11
11 T.Sharon - A.Frank Web IR Advantages High Linkage Interactivity Statistics –easy to gather –large sample sizes
12
12 T.Sharon - A.Frank Evaluation in the Web Context Quality of pages varies widely Relevance is not enough We need both relevance and high quality = value of page
13
13 T.Sharon - A.Frank Example of Web IR Query Results
14
14 T.Sharon - A.Frank How to Compare Web Search Engines? Search engines hold huge repositories! Search engines hold different resources! Solution: Precision at top 10 –% of top 10 pages that are relevant (“ranking quality”) Retrieved (Ret) Resource s RR Relevant Returned
15
15 T.Sharon - A.Frank The ‘HITS’ Scoring Method New method from 1998: –improved quality –reduced number of retrieved documents Based on the Web high linkage Simplified implementation in Google (www.google.com) Advanced implementation in Clever Reminder: Hypertext - nonlinear graph structure
16
16 T.Sharon - A.Frank ‘HITS’ Definitions Authorities: good sources of content Hubs: good sources of links A H
17
17 T.Sharon - A.Frank ‘HITS’ Intuition Authority comes from in-edges. Being a hub comes from out- edges. Better authority comes from in-edges from hubs. Being a better hub comes from out-edges to authorities. AH A H H H H A A A
18
18 T.Sharon - A.Frank v ‘HITS’ Algorithm A w1w1 H w2w2 wkwk... u1u1 u2u2 ukuk Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := AUTH[u i ] for all u i with Edge(v,u i ) AUTH[v] := HUB[w i ] for all w i with Edge(w i,v)
19
19 T.Sharon - A.Frank Google Output: Princess Diana
20
20 T.Sharon - A.Frank Prototype Implementation (Clever) Base Root 1. Selecting documents using index (root) 2. Adding linked documents 3. Iterating to find hubs and authorities
21
21 T.Sharon - A.Frank By-products Separates Web sites into clusters. Reveals the underlying structure of the World Wide Web.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.