Thanks to Ray Mooney & Scott White

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Advances & Link Analysis
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Link Analysis HITS Algorithm PageRank Algorithm.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Web Search Slides based on those of C. Lee Giles, who credits R. Mooney, S. White, W. Arms C. Manning, P. Raghavan, H. Schutze.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Xin Liu.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anatomy of a search engine
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Thanks to Ray Mooney & Scott White How Google Works: Connectivity and Link Analysis Thanks to Ray Mooney & Scott White

Search Engines What is connectivity? Role of connectivity in ranking Academic paper analysis Hits - IBM Google CiteSeer

HTML Structure & Feature Weighting Weight tokens under particular HTML tags more heavily: <TITLE> tokens (Google seems to like title matches) <H1>,<H2>… tokens <META> keyword tokens Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.

Link Analysis What is link analysis? For academic documents CiteSeer is an example of such a search engine Others SMEALSearch eBizSearch

Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. Impact of paper!

Bibliographic Coupling Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B

Co-Citation An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B. Maybe want to normalize by total number of documents citing either A or B ? A B

Citations vs. Links Web links are a bit different than citations: Many links are navigational. Many pages with high in-degree are portals not content providers. Not all links are endorsements. Company websites don’t point to their competitors. Citations to relevant literature is enforced by peer-review.

Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?

Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Ex: pages are included in the course home page

HITS Algorithm developed by Kleinberg in 1998. IBM search engine project Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs.

Hubs and Authorities Together they tend to form a bipartite graph:

HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the base set S. Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.

Authorities and In-Degree Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon). True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities).

Iterative Algorithm Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. Maintain for each page p  S: Authority score: ap (vector a) Hub score: hp (vector h) Initialize all ap = hp = 1 Maintain normalized scores:

HITS Results An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: Atari video game (principal eigenvector) NFL Football team (2nd non-princ. eigenvector) Automobile (3rd non-princ. eigenvector)

“Our main goal is to improve the quality of web search engines” Google Background “Our main goal is to improve the quality of web search engines” Google  googol = 10^100 Originally part of the Stanford digital library project known as WebBase, commercialized in 1999

Initial Design Goals Deliver results that have very high precision even at the expense of recall Make search engine technology transparent, i.e. advertising shouldn’t bias results Bring search engine technology into academic realm in order to support novel research activities on large web data sets Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple words

Google Search Engine Features Two main features to increase result precision: Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages

PageRank in Words Intuition: Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page.

PageRank Link-analysis method used by Google (Brin & Page, 1998). Does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query.

Initial PageRank Idea Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. Initial page rank equation for page p: Nq is the total number of out-links from page q. A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p). c is a normalizing constant set so that the rank of all pages always sums to 1.

Initial PageRank Idea (cont.) Can view it as a process of PageRank “flowing” from pages to the pages they cite. .08 .03 .1 .05 .03 .09

Sample Stable Fixpoint 0.2 0.4 0.2 0.2 0.2 0.4 0.4

Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system. Rank flows into cycle and can’t get out

Rank Source Introduce a “rank source” E that continually replenishes the rank of each page, p, by a fixed amount E(p).

Random Surfer Model PageRank can be seen as modeling a “random surfer” that starts on a random page and then at each point: With probability E(p) randomly jumps to page p. Otherwise, randomly follows a link on the current page. R(p) models the probability that this random surfer will be on page p at any given time. “E jumps” are needed to prevent the random surfer from getting “trapped” in web sinks with no outgoing links.

Justifications for using PageRank Attempts to model user behavior Captures the notion that the more a page is pointed to by “important” pages, the more it is worth looking at Takes into account global structure of web

Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log n) (where n is the number of links). Therefore calculation is quite efficient.

Google Ranking Complete Google ranking includes (based on university publications prior to commercialization). Vector-space similarity component. Keyword proximity component. HTML-tag weight component (e.g. title preference). PageRank component. Details of current commercial ranking functions are trade secrets.

Personalized PageRank PageRank can be biased (personalized) by changing E to a non-uniform distribution. Restrict “random jumps” to a set of specified relevant pages. For example, let E(p) = 0 except for one’s own home page, for which E(p) =  This results in a bias towards pages that are closer in the web graph to your own homepage.

Google PageRank-Biased Crawling Use PageRank to direct (focus) a crawler on “important” pages. Compute page-rank using the current set of crawled pages. Order the crawler’s search queue based on current estimated PageRank.

Link Analysis Conclusions Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google’s success.

Limits of Link Analysis Stability Adding even a small number of nodes/edges to the graph has a significant impact Topic drift A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page Content evolution Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks

Google Architecture Implemented in Perl, C and C++ on Solaris and Linux

Preliminary “Hitlist” is defined as list of occurrences of a particular word in a particular document including additional meta info: - position of word in doc - font size - capitalization - descriptor type, e.g. title, anchor, etc.

Google Architecture (cont.) Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. Keeps track of URLs that have and need to be crawled Compresses and stores web pages Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Uncompresses and parses documents. Stores link information in anchors file. Contains full html of every web page. Each document is prefixed by docID, length, and URL.

Google Architecture (cont.) Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of docIds). Parses & distributes hit lists into “barrels.” Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs. In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into. Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID. DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.

Google Architecture (cont.) 2 kinds of barrels. Short barrell which contain hit list which include title or anchor hits. Long barrell for all hit lists. List of wordIds produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words. New lexicon keyed by wordID, inverted doc index keyed by docID, and PageRanks used to answer queries

Google Architecture Over 50,000 linux boxes 6 billion pages and counting