Internet Systems Review. Generally Speaking Understand the essence of the papers/systems we’ve studied. Understand taxonomies/criteria for comparison.

1 Internet Systems Review

2 Generally Speaking Understand the essence of the papers/systems we’ve studied. Understand taxonomies/criteria for comparison. Terminology Closed books/notes

3 Papers Kleinberg Google Ferguson, Google vs. Microsoft Cho– Rich get Richer Pitkow Lieberman Nelson Berners-Lee

4 Systems Google HITS Outride Direct Hit Letizia Powerscout Watson Margin Notes Xanadu Webtop/Open Search

5 Search Evaluation Precision and Recall Relevance consensus relevance author relevance topic-specific relevance Evaluations provided in papers Google HITS Cho Outride TREC – Text Retrieval Conference Standard testbeds for search evaluation

6 Precision and Recall What is your precision and recall if: You have a repository of a million documents, and you need to find out about government subsidies for llama farming. Of those million documents, twenty are relevant to your needs. You do a search and the first page of your result list contains sixteen documents. Of those sixteen, ten are among those relevant to llama subsidies.

7 Precision and Recall, Answer Recall is 10/20, or 50%. Precision is 10/16, or 62.5%.

8 Hubs and Authorities Hub-- A page that points to many authorities Authority: A page that is pointed to by many hubs. What current system uses this concept for “subject-specific” ranking.

9 HITS Get initial result list using traditional IR Add ins/outs to set Run iterative algorithm, computing hub and authority score for each page on each iteration.

10 HITS – Hubs and Authorities Consider the following link graph table. An x in the row labeled d1 means d1 points at that page, e.g., d1 points at d2 and d4. Suppose after the initial text- based search and afteradding ins and outs, we were left wit the seven documents in the table above. Compute the Hub and Authority score of the seven documents, given an initial score of 1 for each. You need not normalize any scores and you need run through only two iterations. d1d2d3d4d5d6d7 d1xx d2xxx d3x d4xx d5xx d6x d7xx

11 HITS vs. Page Rank How could the concept of hubs/authorities improve on page rank?

12 Important in general vs. Authority for a specific topic Generally important Authority for topic A Hubs for topic A

13 What are disadvantages of HITS relative to Page Rank Potential Topic Drift TF not counted in Ranking But only documents with terms used. Run-Time Delay

14 Page Rank PR(p) = (1-d) + d (PR(in1)/outDegree(in1) + PR(in2)/outDegree(in2) + … ) where p is the page for which you are computing page rank, d is a dampening factor, in i is the ith page pointing at page p. Explain the heuristics on which this formula is based.

15 Heuristics in Page Rank Popular page is one pointed to by lots of popular pages. If a page links to a bunch of other pages including p, p gets less credit random surfer model basis See ex.html for more info on how page rank works. ex.html

16 Easy Question With the Random Surfer model is the user randomly visiting pages?

17 Inverted Index word  hit – hit – hit word2  hit- hit – hit – hit …. plain/fancy docid position in document If two keywords input to a search, how are results computed?

18 Anchors Google associates text in anchor with page and page pointed to. Reason 1: Anchors often provide more accurate descriptions of pointed to page. Reason 2: Anchors provide text for images, programs, etc.

19 Building an inverse index Suppose the following two documents were crawled by a search engine that built an inverse index similar to that of Google's. Show the inverse index that would be built. hello world Nothing big bad world

20 Sample inverse index hello – doc1 world – doc1 – doc2 Nothing – doc1 – doc2 big – doc2 bad – doc2

21 Pages without keywords Describe how Page Rank and HITS allow pages that don’t contain keywords to be discovered as results. Does this help recall or precision? Both? What else is it helpful for?

22 Cho: The Rich get Richer Search-dominant model User’s rarely look at any but top results New, quality pages have difficulty breaking in. When popularity does increase, its quite sudden.

23 Personalization and Contextual Computing Outride Letizia Powerscout Watson Margin Notes Google What contextual information used How is it applied? Transparency Obtrusiveness Privacy

24 What contextual information is used? User Profile(s) data explicitly input by user browsing history usage statistics click popularity, stickiness bookmarks documents Currently Open Documents Collaborative filtering

25 How is context applied Query Augmentation and automated query creation (automated information queries often using TFIDF) Result Processing Limiting the Search Space Notifying user of previous searches Eurekster

26 Limiting Search Space Domain-specific libraries explicit user choice (webtop) automated two-phase (webtop++) Neighborhood of current page (Letizia) Seen/Haven’t seen (Outride)

27 Contextual Computing Issues Identifying context switching, changing interests Task model Multiple profiles Transparency Does the user know what the system is doing? User-Agent collaboration (e.g., Google Personal) Obtrusiveness Especially for automated information queries, but also consider complexity of search. Efficiency (Pitkow stressed this) Privacy

28 Metasearch API based as opposed to Scraping Exploits advantage of subsets of web Role a Standard API could play dynamic list of information sources Independence of sources/metasearch

29 Search in the World Index Everything phone conversations, email, pdf data Hidden web The Role of APIs Separating presentation and data. Economic benefit? Standards

30 Search Results Clustering Tree/Graph view see TouchGraphTouchGraph

31 Personal Information Management Associative Trails (Bush) Entity Associations NOT made by author and NOT embedded in either entity is shared bookmarks (King) bookmark = url – assoc – comment Semantic web generalizes (Berners-Lee) thing – assoc -- thing

32 Personal Information Management “Document” wrong granularity Blogs sending us this way Document as a list of content pointers (Nelson) Versioning and Permanence global address space (Nelson, Berners-Lee, Archive) Deep 2-way links Can get to the full context of content Structured over unstructured data

