Presentation is loading. Please wait.

Presentation is loading. Please wait.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR.

Similar presentations


Presentation on theme: "T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR."— Presentation transcript:

1 T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR

2 2 T.Sharon - A.Frank Web IR What’s Different about Web IR? Web IR Queries How to Compare Web Search Engines? The ‘HITS’ Scoring Method

3 3 T.Sharon - A.Frank What’s different about the Web? Bulk ……………... (500M); growth at 20M/month Lack of Stability..… Estimates: 1%/day--1%/week Heterogeneity –Types of documents …. text, pictures, audio, scripts... –Quality –Document Languages ……………. 100+ Duplication Non-running text High Linkage..…………. 8 links/page average > =

4 4 T.Sharon - A.Frank Taxonomy of Web Document Languages SGML HyTime XML Metalanguages Languages SMILMathMLRDFXHTML HTMLTEI Lite DSSSL XSL CSS Style sheets

5 5 T.Sharon - A.Frank Non-running Text

6 6 T.Sharon - A.Frank What’s different about the Web Users?  Make poor queries –short (2.35 terms average) –imprecise terms –sub-optimal syntax (80% without operators) –low effort  Wide variance on –Needs –Expectations –Knowledge –Bandwidth  Specific behavior –85% look over one result screen only –78% of queries not modified

7 7 T.Sharon - A.Frank Why don’t the Users get what they Want? User need User request (verbalized) Query to IR system Results Translation problems Polysemy Synonymy Example I need to get rid of mice in the basement What ’ s the best way to trap mice alive? Mouse trap Computer supplies software, etc

8 8 T.Sharon - A.Frank Alta Vista: Mouse trap

9 9 T.Sharon - A.Frank Alta Vista: Mice trap

10 10 T.Sharon - A.Frank Challenges on the Web Distributed data Dynamic data Large volume Unstructured and redundant data Data quality Heterogeneous data

11 11 T.Sharon - A.Frank Web IR Advantages  High Linkage  Interactivity  Statistics –easy to gather –large sample sizes

12 12 T.Sharon - A.Frank Evaluation in the Web Context Quality of pages varies widely Relevance is not enough We need both relevance and high quality = value of page

13 13 T.Sharon - A.Frank Example of Web IR Query Results

14 14 T.Sharon - A.Frank How to Compare Web Search Engines?  Search engines hold huge repositories!  Search engines hold different resources! Solution: Precision at top 10 –% of top 10 pages that are relevant (“ranking quality”) Retrieved (Ret) Resource s RR Relevant Returned

15 15 T.Sharon - A.Frank The ‘HITS’ Scoring Method New method from 1998: –improved quality –reduced number of retrieved documents Based on the Web high linkage Simplified implementation in Google (www.google.com) Advanced implementation in Clever  Reminder: Hypertext - nonlinear graph structure

16 16 T.Sharon - A.Frank ‘HITS’ Definitions Authorities: good sources of content Hubs: good sources of links A H

17 17 T.Sharon - A.Frank ‘HITS’ Intuition Authority comes from in-edges. Being a hub comes from out- edges. Better authority comes from in-edges from hubs. Being a better hub comes from out-edges to authorities. AH A H H H H A A A

18 18 T.Sharon - A.Frank v ‘HITS’ Algorithm A w1w1 H w2w2 wkwk... u1u1 u2u2 ukuk Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] :=  AUTH[u i ] for all u i with Edge(v,u i ) AUTH[v] :=  HUB[w i ] for all w i with Edge(w i,v)

19 19 T.Sharon - A.Frank Google Output: Princess Diana

20 20 T.Sharon - A.Frank Prototype Implementation (Clever) Base Root 1. Selecting documents using index (root) 2. Adding linked documents 3. Iterating to find hubs and authorities

21 21 T.Sharon - A.Frank By-products Separates Web sites into clusters. Reveals the underlying structure of the World Wide Web.


Download ppt "T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR."

Similar presentations


Ads by Google