Web Crawling.

Web Crawling

Focused Crawling Incremental Crawling

Crawling Lingo

Breadth-First Crawl BFS – Breadth First Search
The frontier is the web pages whose links have not been explored. It begins with the “seeds” BFS treats the frontier as a FIFO Queue. Grab the first page, extract its links and put them on the end.

Best First Sort frontier based on estimation criterion
e.g., how many times terms of topic appear in document (cosine similarity) Remove lowest from frontier Others have used anchor text so as not to have to open page. Shark Search uses anchor text, text around anchor, and inherited score from ancestors.

Chakrabarti: Two measures
Relevance of a Document to a topic CLASSIFIER How Beneficial it is to crawl a document DISTILLER A very relevant page without links is only a finishing point in the crawl. In contrast, hubs are good for crawling, and good hubs should be checked frequently for new resource links.

Master Category Tree Root Rroot(p)=1 All documents
Sum (Rci(p)=Rc(p) where ci are children of c A B A1 A2

Chakrabarti Applied Hubs/Authorities
Authorities should be put into final set. Authorities might not have any (good) links. What you really want to crawl next are hubs– pages that point to a bunch of authorities.

Chakrabarti, Heuristics
Pages cite pages on related subjects A page that points to one page with a desired topic is more likely than a random page to point to other pages with desired topics. e.g. in one test, a page that pointed to a given first level Yahoo topic had a 45% chance of pointing to another.

Intro to IR Slides discusses cosine measures discusses process of automatically “generating” documents (as in Chakrabarti paper)

Category Tree

Web Crawling.

Similar presentations

Presentation on theme: "Web Crawling."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Crawling.

Similar presentations

Presentation on theme: "Web Crawling."— Presentation transcript:

Similar presentations

About project

Feedback