Download presentation
Presentation is loading. Please wait.
1
Learning Bit by Bit Search
2
Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force
3
The Internet Search Engines
4
Anatomy of a Search Engine Crawler Indexer Query Matcher
5
Crawling Process 1.Seeding 2.Run: a. Check for Exit Criteria b. Retrieve Content c. Extract URLs d. Determine next URL
6
Seeding Initial Set of URLs
7
Exit Criteria When to stop: – Threshold of pages met – Time period – No more URLs
8
Retrieve Content Read the page Optionally determine if it is relevant
9
Extract URLs Depth check ie. 5 Content is parsed for hyperlinks
10
Invisible webpages robots.txt sitemaps
11
Determine Next URL
12
Breadth first Weighted search: – Authority
13
Demo Simple Crawler on nytimes.com
14
Indexing Lucene, etc.
15
Queries Avoid spam Find the best answer
16
Link Analysis/Pagerank Citations Hyperlinks as recommendations
17
Link Analysis/Pagerank j-> i nytimes.commoma.org nytimes.com01/3 moma.org1/100 Links to page j / number outlinks from page i ex. nytimes.com has 3 outlinks 1 of which is to moma.org moma.org has 10 outlinks 1 of which is to nytimes.com
18
Precision vs. Recall Recall – percentage of relevant documents returned by the algorithm – True positives / true positives + false negatives Precision – percentage of returned documents the user finds relevant – True positives / true positives + false positives
19
Demos Spam avoidance with pagerank Scaling up search with nutch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.