1 Searching the Web Representation and Management of Data on the Internet.

1 Searching the Web Representation and Management of Data on the Internet

2 Goal To better understand Web search engines: –Fundamental concepts –Main challenges –Design issues –Implementation techniques and algorithms

3 What does it do? Processes users queries Finds pages with related information Returns a resources list Is it really that simple? Is creating a search engine much more difficult than ex1 + ex2?

4 Motivation The web is –Used by millions –Contains lots of information –Link based –Incoherent –Changes rapidly –Distributed Traditional information retrieval was built with the exact opposite in mind

5 The Web’s Characteristics Size –Over a billion pages available (Google is a spelling of googol = 10 100 ) –5-10K per page => tens of terrabytes –Size doubles every 2 years Change –23% change daily –About half of the pages do not exist after 10 days –Bowtie structure

6 Bowtie Structure Core: Strongly connected component (28%) Reachable from core (22%) Reach the core (22%)

7 Search Engine Components User Interface Crawler Indexer Ranker

8 HTML Forms on One Foot

9 HTML Forms Search engines usually use an HTML form. How are forms defined?

10 HTML Behind the Form Search For: Defines an HTML form that: uses the HTTP method GET (you could use POST instead) will send form info to http://search.dbi.com/search

11 HTML Behind the Form Search For: Defines a text box name=“query” defines the parameter “query” which will get the value of this text box when the data is submitted

12 HTML Behind the Form Search For: The submit button, labeled with “Search” When this button is pressed, an HTTP request will be generated of the following form: GET http://search.dbi.com/search?query=encode(text_box) HTTP/1.1 If there were additional parameters defined, they would be added to the url with the & sign dividing parameters

13 Example bananas apples http://search.dbi.com/search?query=bananas+apples

14 Post Versus Get Suppose we had the line Then, pressing submit would cause a POST HTTP request to be sent. The values of the parameters would be sent in the body of the request, instead of as part of the url

15 HTML Behind the Form Search For: The reset button, labeled with “Clear” Clears the form

16 Crawling the Web

17 Basic Crawler (Spider) Queue of Pages removeBestPage( ) findLinksInPage( ) insertIntoQueue( ) A crawler finds web pages to download into a search engine cache

18 Choosing Pages to Download Q: Which pages should be downloaded? A: It is usually not possible to download all pages because of space limitations. Try to get the most important pages Q: When is a page important? A: Use a metric – by interest, by popularity, by location, or combination

19 Interest Driven Suppose that there is a query Q that contains the words we will be interested in. Define the importance of a page P by its textual similarity to Q Example: TF-IDF(P, Q) = Sum w in Q (TF(P,w)/DF(w)) Problem: We must decide if a page is important while crawling. However, we don’t know DF until the crawl is complete Solution: Use an estimate This is what you are using in Ex2!

20 Popularity Driven The importance of a page P is proportional to the number of pages with a link to P This is also called the number of back links of P As before, need to estimate this amount There is a more sophisticated metric, called PageRank (will be taught later in the course)

21 Location Driven The importance of P is a function of its url Example: –Words appearing on URL (e.g. com) –Number of “/” on the URL Easily evaluated, requires no data from pervious crawls Note: We can also use a combination of all three metrics

22 Refreshing Web Pages Pages that have been downloaded must be refreshed periodically Q: Which pages should be refreshed? Q: How often should we refresh a page? In Ex2, you never refresh pages 

23 Freshness Metric A cached page is fresh if it is identical to the version on the web Suppose that S is a set of pages (i.e., a cache) Freshness(S) = (number of fresh pages in S) number of pages in S

24 Age Metric The age of a page is the number of days since it was refreshed Suppose that S is a set of pages (i.e., a cache) Age(S) = Average age of pages in S

25 Refresh Goal Goal: Minimize the age of a cache an maximize the freshness of a cache. Crawlers can refresh only a certain amount of pages in a period of time. The page download resource can be allocated in many ways We need a refresh strategy

26 Refresh Strategies Uniform Refresh: The crawler revisits all pages with the same frequency, regardless of how often they change Proportional Refresh: The crawler revisits a page with frequency proportional to the page’s change rate (i.e., if it changes more often, we visit it more often) Which do you think is better?

27 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit one page per week How should we visit pages? –e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional] –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –?–? e1e1 e2e2 e1e1 e2e2 web database

28 Proportional Often Not Good! Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

29 Another Example The collection contains 2 pages: e 1 changes 9 times a day, e 2 changes once a day Simplified change model: –Day is split into 9 equal intervals: e 1 changes once on each interval, and e 2 changes once during the day –Don’t know when the pages change within the intervals The crawler can download a page a day. Our goal is to maximize the freshness

30 Which Page Do We Refresh? Suppose we refresh e 2 in midday If e 2 changes in first half of the day, it remains fresh for the rest (half) of the day. –50% for 0.5 day freshness increase –50% for no increase –Expectancy of 0.25 day freshness increase

31 Which Page Do We Refresh? Suppose we refresh e 1 in midday If E1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day. –50% for 1/18 day freshness increase –50% for no increase –Expectancy of 1/36 day freshness increase

32 Not Every Page is Equal! Suppose that e 1 is accessed twice as often as e 2 Then, it is twice as important to us that e 1 is fresh than it is that e 2 is fresh

33 Politeness Issues When a crawler crawls a site, it uses the site’s resources: –web server needs to find file in file system –web server needs to send file in the network If a crawler asks for many of the pages and at a high speed it may –crash the sites web server or –be banned from the site Solution: Ask for pages “slowly”

34 Politeness Issues (cont) A site may identify pages that it doesn’t want to be crawled A polite crawler will not crawl these sites (although nothing stops the crawler from being impolite) Put a file called robots.txt at the main directory to identify pages that should not be crawled (e.g., http://www.cnn.com/robots.txt)

35 robots.txt Use the header User-Agent to identify programs whose access should be restricted Use the header Disallow to identify pages that should be restricted Example

36 Other Issues Suppose that a search engine uses several crawlers at the same time (in parallel) How can we make sure that they are not doing the same work?

37 Index Repository

38 Storage Challenges Scalability: Should be able to store huge amounts of data (data spans disks or computers) Dual Access Mode: Random access (find specific pages) and Streaming access (find large subsets of pages) Large Batch Updates: Reclaim old space, avoid access/update conflicts Obsolete Pages: Remove pages no longer on the web (how do we find these pages?)

39 Update Strategies Updates are generated by the crawler Several characteristics –Time in which the crawl occurs and the repository receives information –Whether the crawl’s information replaces the entire database or modifies parts of it

40 Batch Crawler vs. Steady Crawler Batch mode –Periodically executed –Allocated a certain amount of time Steady mode –Run all the time –Always send results back to the repository

41 Partial vs. Complete Crawls A batch mode crawler can either do –A complete crawl every run, and replace entire cache –A partial crawl and replace only a subset of the cache The repository can implement –In place update: Replaces the data in the cache, thus, quickly refreshes pages –Shadowing: Create a new index with updates, and later replace the previous, thus, avoiding refresh- access conflicts

42 Partial vs. Complete Crawls Shadowing resolves the conflicts between updates and read for the queries Batch mode suits well with shadowing Steady crawler suits with in place updates

43 Types of Indices Types of Indices Content index: Allow us to easily find pages with certain words Links index: Allow us to easily find links between pages Utility index: Allow us to easily find pages in certain domain, or of a certain type, etc. Q: What do we need these for?

44 Is the Content Index From Ex1 Good? In Ex1, most of you had a table: We want to quickly find pages with a specific word Is this a good way of storing a content index? WordFrequencyUrlId...

45 Is the Content Index From Ex1 Good? NO If a word appears in a thousand documents, then the word will be in a thousand rows. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents Does not easily support queries that require multiple words

46 Inverted Keyword Index evil: (1, 5, 11, 17)saddam: (3, 5, 11, 17) war: (3, 5, 17, 28) butterfly: (22, 4) Hashtable Words as keys lists of matching documents as the values lists are sorted by urlId

47 Query: “evil saddam war” evil: (1, 5, 11, 17) saddam: (3, 5, 11, 17) war: (3, 5, 17, 28) 517 Answers: Algorithm: Always advance pointer(s) with lowest urlId

48 Challenges Index build must be : – Fast – Economic Incremental Indexing must be supported Tradeoff when using compression: memory is saved but time is lost compressing and uncompressing

49 How do we distribute the indices between files? Local inverted file –Each file contains disjoint random pages of the index –Query is broadcasted. –Result is the merged query answers. Global inverted file –Each file is responsible for a subset of terms in the collection. –Query “sent” only to the apropriate files

50 Ranking

51 Traditional Ranking Faults (e.g., TF-IDF) Many pages containing a term may be of poor quality or not relevant People put popular words in irrelevant sites to promote the site Queries are short, so containing the words from a query does not indicate importance

52 Additional Factors for Ranking Links: If an important page links to P, then P must be important Words on links: If a page links to P with the query keyword in the link text, the P must really be about the keywords Style of words: If a keyword appears in P in a title, header, large font size, it is more important Ranking will be taught later in the semester

1 Searching the Web Representation and Management of Data on the Internet.

Similar presentations

Presentation on theme: "1 Searching the Web Representation and Management of Data on the Internet."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Searching the Web Representation and Management of Data on the Internet.

Similar presentations

Presentation on theme: "1 Searching the Web Representation and Management of Data on the Internet."— Presentation transcript:

Similar presentations

About project

Feedback