Internet Systems Review. Generally Speaking Understand the essence of the papers/systems we’ve studied. Understand taxonomies/criteria for comparison.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Search Engines and Information Retrieval
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Chapter 19: Information Retrieval
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Reconnaissance Agents. Henry Lieberman MIT Media Lab Home Page Software Agents End-User Programming Common Sense.
Information Retrieval
Overview of Web Data Mining and Applications Part I
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
1 Chapter 21: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Information Retrieval.
Search Engines and Information Retrieval Chapter 1.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Tag Data and Personalized Information Retrieval 1.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Personalized Search Xiao Liu
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Searching for NZ Information in the Virtual Library Alastair G Smith School of Information Management Victoria University of Wellington.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
User Modeling for Personal Assistant
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval (in Practice)
A Contextual Computing approach towards Personalized Search
Web Mining Ref:
Augmenting (personal) IR
Prepared by Rao Umar Anwar For Detail information Visit my blog:
CSE 454 Advanced Internet Systems University of Washington
Search Search Engines Search Engine Optimization Search Interfaces
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Junghoo “John” Cho UCLA
Information Retrieval and Web Design
Discussion Class 9 Google.
Presentation transcript:

Internet Systems Review

Generally Speaking Understand the essence of the papers/systems we’ve studied. Understand taxonomies/criteria for comparison. Terminology Closed books/notes

Papers Kleinberg Google Ferguson, Google vs. Microsoft Cho– Rich get Richer Pitkow Lieberman Nelson Berners-Lee

Systems Google HITS Outride Direct Hit Letizia Powerscout Watson Margin Notes Xanadu Webtop/Open Search

Search Evaluation Precision and Recall Relevance consensus relevance author relevance topic-specific relevance Evaluations provided in papers Google HITS Cho Outride TREC – Text Retrieval Conference Standard testbeds for search evaluation

Precision and Recall What is your precision and recall if: You have a repository of a million documents, and you need to find out about government subsidies for llama farming. Of those million documents, twenty are relevant to your needs. You do a search and the first page of your result list contains sixteen documents. Of those sixteen, ten are among those relevant to llama subsidies.

Precision and Recall, Answer Recall is 10/20, or 50%. Precision is 10/16, or 62.5%.

Hubs and Authorities Hub-- A page that points to many authorities Authority: A page that is pointed to by many hubs. What current system uses this concept for “subject-specific” ranking.

HITS Get initial result list using traditional IR Add ins/outs to set Run iterative algorithm, computing hub and authority score for each page on each iteration.

HITS – Hubs and Authorities Consider the following link graph table. An x in the row labeled d1 means d1 points at that page, e.g., d1 points at d2 and d4. Suppose after the initial text- based search and afteradding ins and outs, we were left wit the seven documents in the table above. Compute the Hub and Authority score of the seven documents, given an initial score of 1 for each. You need not normalize any scores and you need run through only two iterations. d1d2d3d4d5d6d7 d1xx d2xxx d3x d4xx d5xx d6x d7xx

HITS vs. Page Rank How could the concept of hubs/authorities improve on page rank?

Important in general vs. Authority for a specific topic Generally important Authority for topic A Hubs for topic A

What are disadvantages of HITS relative to Page Rank Potential Topic Drift TF not counted in Ranking But only documents with terms used. Run-Time Delay

Page Rank PR(p) = (1-d) + d (PR(in1)/outDegree(in1) + PR(in2)/outDegree(in2) + … ) where p is the page for which you are computing page rank, d is a dampening factor, in i is the ith page pointing at page p. Explain the heuristics on which this formula is based.

Heuristics in Page Rank Popular page is one pointed to by lots of popular pages. If a page links to a bunch of other pages including p, p gets less credit random surfer model basis See ex.html for more info on how page rank works. ex.html

Easy Question With the Random Surfer model is the user randomly visiting pages?

Inverted Index word  hit – hit – hit word2  hit- hit – hit – hit …. plain/fancy docid position in document If two keywords input to a search, how are results computed?

Anchors Google associates text in anchor with page and page pointed to. Reason 1: Anchors often provide more accurate descriptions of pointed to page. Reason 2: Anchors provide text for images, programs, etc.

Building an inverse index Suppose the following two documents were crawled by a search engine that built an inverse index similar to that of Google's. Show the inverse index that would be built. hello world Nothing big bad world

Sample inverse index hello – doc1 world – doc1 – doc2 Nothing – doc1 – doc2 big – doc2 bad – doc2

Pages without keywords Describe how Page Rank and HITS allow pages that don’t contain keywords to be discovered as results. Does this help recall or precision? Both? What else is it helpful for?

Cho: The Rich get Richer Search-dominant model User’s rarely look at any but top results New, quality pages have difficulty breaking in. When popularity does increase, its quite sudden.

Personalization and Contextual Computing Outride Letizia Powerscout Watson Margin Notes Google What contextual information used How is it applied? Transparency Obtrusiveness Privacy

What contextual information is used? User Profile(s) data explicitly input by user browsing history usage statistics click popularity, stickiness bookmarks documents Currently Open Documents Collaborative filtering

How is context applied Query Augmentation and automated query creation (automated information queries often using TFIDF) Result Processing Limiting the Search Space Notifying user of previous searches Eurekster

Limiting Search Space Domain-specific libraries explicit user choice (webtop) automated two-phase (webtop++) Neighborhood of current page (Letizia) Seen/Haven’t seen (Outride)

Contextual Computing Issues Identifying context switching, changing interests Task model Multiple profiles Transparency Does the user know what the system is doing? User-Agent collaboration (e.g., Google Personal) Obtrusiveness Especially for automated information queries, but also consider complexity of search. Efficiency (Pitkow stressed this) Privacy

Metasearch API based as opposed to Scraping Exploits advantage of subsets of web Role a Standard API could play dynamic list of information sources Independence of sources/metasearch

Search in the World Index Everything phone conversations, , pdf data Hidden web The Role of APIs Separating presentation and data. Economic benefit? Standards

Search Results Clustering Tree/Graph view see TouchGraphTouchGraph

Personal Information Management Associative Trails (Bush) Entity Associations NOT made by author and NOT embedded in either entity De.lic.io.us is shared bookmarks (King) bookmark = url – assoc – comment Semantic web generalizes (Berners-Lee) thing – assoc -- thing

Personal Information Management “Document” wrong granularity Blogs sending us this way Document as a list of content pointers (Nelson) Versioning and Permanence global address space (Nelson, Berners-Lee, Archive) Deep 2-way links Can get to the full context of content Structured over unstructured data