Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.

Learning Bit by Bit Search

Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force

The Internet Search Engines

Anatomy of a Search Engine Crawler Indexer Query Matcher

Crawling Process 1.Seeding 2.Run: a. Check for Exit Criteria b. Retrieve Content c. Extract URLs d. Determine next URL

Seeding Initial Set of URLs

Exit Criteria When to stop: – Threshold of pages met – Time period – No more URLs

Retrieve Content Read the page Optionally determine if it is relevant

Extract URLs Depth check ie. 5 Content is parsed for hyperlinks

Invisible webpages robots.txt sitemaps

Determine Next URL

Breadth first Weighted search: – Authority

Demo Simple Crawler on nytimes.com

Indexing Lucene, etc.

Queries Avoid spam Find the best answer

Link Analysis/Pagerank Citations Hyperlinks as recommendations

Link Analysis/Pagerank j-> i nytimes.commoma.org nytimes.com01/3 moma.org1/100 Links to page j / number outlinks from page i ex. nytimes.com has 3 outlinks 1 of which is to moma.org moma.org has 10 outlinks 1 of which is to nytimes.com

Precision vs. Recall Recall – percentage of relevant documents returned by the algorithm – True positives / true positives + false negatives Precision – percentage of returned documents the user finds relevant – True positives / true positives + false positives

Demos Spam avoidance with pagerank Scaling up search with nutch

Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.

Similar presentations

Presentation on theme: "Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.

Similar presentations

Presentation on theme: "Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force."— Presentation transcript:

Similar presentations

About project

Feedback