ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

ISP 433/633 Week 7 Web IR

Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing exponentially 320 Million Web pages [Lawrence & Giles 1998] 800 Million Web pages, 15 TB [Lawrence & Giles 1999] 3 Billion Web pages indexed [Google 2003]

Web serves a unique user base Virtually Anyone No training All kinds of information needs

What Do People Search for on the Web? (from Spink et al. 1998 study) Topics Genealogy/Public Figure:12% Computer related:12% Business:12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14%

Web Queries Short –~2.4 words on average (Aug 2000) –Has increased, was 1.7 (~1997) User Expectations –Many say “the first item shown should be what I want to see”! –This works if the user has the most popular/common notion in mind

How to do Web IR? Take advantage of Hyperlinks Social network analysis –E.g. Small world phenomenon – six degree of separation –Some people are more popular than others Citation analysis –ISI’s Impact Factor = NumOfCitations/NumOfPapers The same type of analysis can be applied for Web page linkage –Link analysis

Link Analysis Assumptions –If the pages pointing to this page are good, then this is also a good page. –The words on the links pointing to this page are useful indicators of what this page is about. Does it work? –Apparently, Google uses it

PageRank Google’s trademarked algorithm (Page etc. 1998) –Named after Larry Page, co-founder of Google Rank importance of a page based on the Web graph –3 billion nodes (pages) and 20 billion edges (links) Independent of query

PageRank Intuition A page’s rank is determined by the sum of its citing pages’ ranks

PageRank calculation Assume page A has pages T1...Tn which point to it (i.e., citations). The parameter d is a damping factor which can be set between 0 and 1(usually set to 0.85). C(A) is defined as the number of links going out of page A. The PageRank of a page A is: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) Start with random guesses of PageRanks Iteratively compute PageRanks for all Until the values are stabilized The average PageRank of all pages is always 1.0

PageRank PageRank calculator: http://webworkshop.net/pagerank_calculator.php3 Use this knowledge to enhance site ranking in Google –Structure your site links to improve the main page’s PageRank –http://www.iprcom.com/papers/pagerank/http://www.iprcom.com/papers/pagerank/

Anchors Words on the links –Often accurate description of the page –Helpful for non-text based information Assign high term-document weight to anchors –Google does this Abuse –Google bombing Try “miserable failure” with Google

HITS Query dependent model (Kleinberg 97) Hubs –Pages that have many outgoing links Authorities –Pages have many links pointing to them Interconnected –A positive two-way feedback –Can be used to calculate each other

HITS Algorithm: –obtain root set using input query (via regular search engine) –expand the root set by radius one –run iterations on the hub and authority scores together –report top-ranking authorities and hubs Can find relevant authorities that do not even contain the original query words

Subject-specific popularity Similar to HITS idea Without prior query Ranks a site based on the number of same-subject pages that reference it –Clustering sites in to communities http://www.teoma.com/

Other Useful Information Directories and categories –E.g. Yahoo Capitalization, font, title, etc. –E.g. Google use these information "click popularity" – number of click on the site "stickiness" – time spent on the site

Web Search Architecture Preprocessing –Collection gathering phase Web crawling –Collection indexing phase Online –Query servers

Standard Web Search Engine Architecture crawler crawl the web create an inverted index Eliminate duplicates Inverted index Search engine servers user query Show results DocIds

Google Architecture

Google Indexing Data Structure A hit is an occurrence of a term in a document Each forward barrel holds a range of wordIDs Short barrels for fancy (title, big font) and anchor hits

Google Query Evaluation

Google Statistics (1998)

Web Crawlers Main idea: –Start with known sites –Record information for these sites –Follow the links from each site –Record information found at new sites –Repeat Page Visit Order –Breadth first search –Depth first search –Best first search (e.g. using PageRank)

Crawling Web Issues Keep out signs –A file called robots.txt tells the crawler which directories are off limits Freshness –Figure out which pages change often, then crawl these often Duplicates, virtual hosts, etc –Convert page contents with a hash function Lots of problems –Server unavailable –Incorrect html –Missing links –Infinite loops Web crawling is difficult to do robustly

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Similar presentations

Presentation on theme: "ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Similar presentations

Presentation on theme: "ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing."— Presentation transcript:

Similar presentations

About project

Feedback