ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Using Hyperlink structure information for web search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Methods and Apparatus for Ranking Web Page Search Results
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Lecture 22 SVD, Eigenvector, and Web Search
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Web Search Engines.
The Search Engine Architecture
Instructor : Marina Gavrilova
Presentation transcript:

ISP 433/633 Week 7 Web IR

Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing exponentially 320 Million Web pages [Lawrence & Giles 1998] 800 Million Web pages, 15 TB [Lawrence & Giles 1999] 3 Billion Web pages indexed [Google 2003]

Web serves a unique user base Virtually Anyone No training All kinds of information needs

What Do People Search for on the Web? (from Spink et al study) Topics Genealogy/Public Figure:12% Computer related:12% Business:12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14%

Web Queries Short –~2.4 words on average (Aug 2000) –Has increased, was 1.7 (~1997) User Expectations –Many say “the first item shown should be what I want to see”! –This works if the user has the most popular/common notion in mind

How to do Web IR? Take advantage of Hyperlinks Social network analysis –E.g. Small world phenomenon – six degree of separation –Some people are more popular than others Citation analysis –ISI’s Impact Factor = NumOfCitations/NumOfPapers The same type of analysis can be applied for Web page linkage –Link analysis

Link Analysis Assumptions –If the pages pointing to this page are good, then this is also a good page. –The words on the links pointing to this page are useful indicators of what this page is about. Does it work? –Apparently, Google uses it

PageRank Google’s trademarked algorithm (Page etc. 1998) –Named after Larry Page, co-founder of Google Rank importance of a page based on the Web graph –3 billion nodes (pages) and 20 billion edges (links) Independent of query

PageRank Intuition A page’s rank is determined by the sum of its citing pages’ ranks

PageRank calculation Assume page A has pages T1...Tn which point to it (i.e., citations). The parameter d is a damping factor which can be set between 0 and 1(usually set to 0.85). C(A) is defined as the number of links going out of page A. The PageRank of a page A is: PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) Start with random guesses of PageRanks Iteratively compute PageRanks for all Until the values are stabilized The average PageRank of all pages is always 1.0

PageRank PageRank calculator: Use this knowledge to enhance site ranking in Google –Structure your site links to improve the main page’s PageRank –

Anchors Words on the links –Often accurate description of the page –Helpful for non-text based information Assign high term-document weight to anchors –Google does this Abuse –Google bombing Try “miserable failure” with Google

HITS Query dependent model (Kleinberg 97) Hubs –Pages that have many outgoing links Authorities –Pages have many links pointing to them Interconnected –A positive two-way feedback –Can be used to calculate each other

HITS Algorithm: –obtain root set using input query (via regular search engine) –expand the root set by radius one –run iterations on the hub and authority scores together –report top-ranking authorities and hubs Can find relevant authorities that do not even contain the original query words

Subject-specific popularity Similar to HITS idea Without prior query Ranks a site based on the number of same-subject pages that reference it –Clustering sites in to communities

Other Useful Information Directories and categories –E.g. Yahoo Capitalization, font, title, etc. –E.g. Google use these information "click popularity" – number of click on the site "stickiness" – time spent on the site

Web Search Architecture Preprocessing –Collection gathering phase Web crawling –Collection indexing phase Online –Query servers

Standard Web Search Engine Architecture crawler crawl the web create an inverted index Eliminate duplicates Inverted index Search engine servers user query Show results DocIds

Google Architecture

Google Indexing Data Structure A hit is an occurrence of a term in a document Each forward barrel holds a range of wordIDs Short barrels for fancy (title, big font) and anchor hits

Google Query Evaluation

Google Statistics (1998)

Web Crawlers Main idea: –Start with known sites –Record information for these sites –Follow the links from each site –Record information found at new sites –Repeat Page Visit Order –Breadth first search –Depth first search –Best first search (e.g. using PageRank)

Crawling Web Issues Keep out signs –A file called robots.txt tells the crawler which directories are off limits Freshness –Figure out which pages change often, then crawl these often Duplicates, virtual hosts, etc –Convert page contents with a hash function Lots of problems –Server unavailable –Incorrect html –Missing links –Infinite loops Web crawling is difficult to do robustly