Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Similar presentations


Presentation on theme: "Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have."— Presentation transcript:

1 Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have to deal with often are: –How to locate information on the web? –What is the quality of the information located?

2 Search Engines2 Searching the Web Web is different from traditional information sources: –Size: about a billion pages. –Average page size: 5-10k –Textual data: tens of terabytes. –Around 2000 size of web was doubling every 2 years. –40% of the pages change daily in.com domain –In 10 days half the pages are gone. Traditional information retrieval methods cannot be used.

3 Search Engines3 Introduction Two main approaches to searching for information on the web have evolved: –Directories, Search Engines and Meta-Search Engines. Directories organize information on the web into a hierarchy of topics and subtopics. Search Engines allow users to submit a query and use it to search their databases. Meta-Search Engines submits a query to more than one search engine.

4 Search Engines4 Web Directories Hyperlinks to web pages organized as hierarchy of topics and sub-topics. Directories can be general or specialized. Very easy to use. User does not need to know exactly what he is searching for. Specialized directories build by experts in the subject.

5 Search Engines5 Search engines are computer programs that does the following: –Accepts query from the user. –Searches its database to match the query. –Collects and returns page URLs containing information that matches the query. –Permits to revise and resubmit a query. Search engine can be general or specialized.

6 Search Engines6 Meta Search Engines A meta-search engine calls more than one search engine to do the searching. E.g. www.metasearch.com, www.dogpile.com. www.metasearch.comwww.dogpile.com Search results are collated into one list or presented separately. Advantage: can access many engines with single query. Disadvantage: many non-interesting pages.

7 Search Engines7 Querying Search Engine... Pattern matching query: a keyword or a group of keywords  engine returns URLs of pages “containing these words”. Example: ice hockey, “ice hockey” Words like a, an, the, of, etc. these are ignored. +a, +an etc. are not. “University of New Hampshire” and “men’s ice hockey”. Stemming. Wildcards.

8 Search Engines8 Search Engine: Working Search engines perform the following basic tasks: –They search the internet based on important words. –They keep an index of words they find and where they find them. –They allow users to look for words or combination of words found in the index. Engines index billions of pages and respond to tens of millions user queries per day. Before programs like Gopher, Archie kept indexes of files stored in the servers.

9 Search Engines9 Search Engine: Working Search Engine consists of the following components: –User interface: user’s type in query and search results are displayed. –Searcher: searches the database –Page Ranker: assigns relevancy scores to the information retrieved from the database.

10 Search Engines10 Search Engine: Working Search engine’s database is built with the following components: –Gatherer (also called spider, worm, crawler): traverses the web to collect information. –Indexer: classifies data gathered by the gatherer and creates an index.

11 Search Engines11

12 Search Engines12 Gatherer or Spider Multiple spiders (3 or 4 or more) browse the web downloading pages into the page repository. For example: a very early version of Google, using 4 spiders would crawl 100 pages per sec., generating 600 Kb/sec. Spiders start with a set of URLs to visit and download pages from. Spiders extract URLs in the downloaded pages and pass them on to a control module.

13 Search Engines13 Gatherer or Spider Control module determines which URLs the spider should visit next. Use Breadth First or Depth First Search. Sometimes a web-site does not want a spider to access and index its pages  this is indicated in the meta tags. Spiders task is never complete…they go on crawling.

14 Search Engines14 Gatherer or Spider There are a number of issues to be considered: –Which pages to crawl and download? –Which pages to refresh after downloading them and at what frequency? –How should the load on a web-site be minimized? –How should the crawling process be parallelized?

15 Search Engines15 Indexer Pages collected must be indexed. Simplest way to index: (word, urls where the word was found). Problem with this approach: no way to tell how the word was used on the page, importantly or trivially, just once or many times  ranking pages becomes difficult. Actually, more information is stored with a word e.g. no. of times it occurs, a weight depending on where it was found (word in title given higher weight) etc.

16 Search Engines16 Indexer Indexers note the words on the page and where the words were found. Some spiders may ignore the common words like, ‘a’, ‘an’, ‘the’ etc. (Google); some do not ignore the common words (AltaVista). Some keep track of the words in the title, sub- headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text.

17 Search Engines17 Indexer Words occurring in the title, subtitles, meta tags and other important positions are given special consideration.

18 Search Engines18 Indexer Indexers build at least (i) text index (ii) structure index or link index…keeping in view the key problem “to find pages most relevant to the query.” Text index (or inverted index): (index term, sorted list of locations) Location: page id + location on page + other info. about the occurrence (e.g font, heading, anchor text, etc.  payload)

19 Search Engines19 Indexer Structure index (or link index): built by viewing the web as a directed graph: pages as nodes and hyperlinks as edges. For each page incoming and outgoing links are stored  neighborhood info. This information is also used by ranking algorithms.

20 Search Engines20 Indexer The index term + the additional information is encoded into a bit string to save space, in the text index. Also in other indexes. Google early version used 2 byte code. Challenges: How handle billion pages index? How to deal with index rebuilds?

21 Search Engines21 Ranking Needed because: –Web is vast…pages found containing the query words may be poor quality & not relevant. –Pages not self descriptive. E.g. query words “search engine” do not yield home pages of common search engines because they do not contain the words “search engine”. –Spamming.

22 Search Engines22 Ranking Link structure of the web contains important information that can be used to filter or rank web pages. A link from page A to page B can be considered as a recommendation of page B by author of A.  Page with many links pointing to it should get higher ranking. Two algorithms based on this: (i) PageRank, (ii) HITS.

23 Search Engines23 Ranking: PageRank Importance of a page based on the “number of pages on the web pointing to it.” Yahoo! homepage more important than KSU homepage. Idea proposed by Lawrence Page and Sergy Brin … creators of Google.

24 Search Engines24 Ranking: PageRank Therefore, Rank(Page P) = No. of other pages pointing to it. Too simple: spamming a problem, a number of pages can be created artificially to point to a page. PageRank extends the basic idea: considers the importance of pages pointing to a given page.  thus a page more important if Yahoo! points to it

25 Search Engines25 Ranking: PageRank Simple PageRank: Let pages on the web be 1, 2, …, m. N(i): No. of outgoing links. B(i): No. of incoming links. PageRank of page i:

26 Search Engines26 Ranking: HITS HITS: hypertext induced topic search. Does not assign a global rank to every page – HITS algorithm is query dependent. Authority pages & hub pages. Assigns two scores: authority score & hub score. Basic idea is to identify a small sub-graph of the web (depending on the user query) and apply link analysis to it to locate the authorities and hubs.

27 Search Engines27 Ranking: HITS The algorithm is a two part algorithm: –Identifying the focused sub-graph. –Performing link analysis on it. Focused subgraph generated by forming a root set R  obtained from text index. R = {a random set of pages containing the given query string} Focused set = { R + pages in the neighbourhood of R}

28 Search Engines28 Ranking: HITS Algorithm HITS: 1. R  set of t pages that contain the query terms. 2. S  R. 3. For each page p  R (a) Include maximum of d pages that p points to in S. 4. Graph induced is the focused sub-graph.

29 Search Engines29 Ranking: HITS This algorithm takes query string, t and d as input parameters. t  limits the size of the root set. d  limits the number of pages added to sub-graph.

30 Search Engines30 Ranking: HITS Link analysis: identifies the hubs & authorities from the expanded set S. Let the pages in the focused sub-graph S be 1, 2, …, n. B(i)  set of pages that point to page i. F(i)  set of pages that the page i points to. Algorithm produces a i and h i for each page in S, from initial arbitrary values of a i and h i

31 Search Engines31 Ranking: HITS Each iteration performs two steps: I and O. I step: O step: The scores are normalized with: Algorithm repeats until the values of a i and h i converge.


Download ppt "Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have."

Similar presentations


Ads by Google