Download presentation
Presentation is loading. Please wait.
1
Search Engine 101 Qu, Miao Nov. 2003
2
Agenda Definition and Types Architecture Robot Overview
How Search Engine Works? Problems of Current Search Engines An example: Google The Future of Search Engine Search Engine vs. Directory Reference
3
What Is Search Engine? Search engines are tools that use computer programs called Spiders and Robots to gather information automatically. They can create specific databases according to the query of the user. Source: “Authority Guide to Evaluating Information on the Internet”, Alison Cooke, 1999
4
The Types of Search Engines
Individual Search Engines compile their own searchable databases on the web. Google. Meta Search Engines do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously Metacrawler, vivisimo Source:
5
Web Search Engine Layers
From description of the FAST search engine, by Knut Risvik
6
Standard Web Search Engine Architecture
Check for duplicates, store the documents DocIds crawl the web user query create an inverted index Inverted index Search engine servers Show results To user Source:
7
Anatomy of Search Engine
Working Procedures Crawling the web (Robot) Establish the database (Robot) Query (searcher) Search the Database (Search Engine Software) Ranking (Search Engine Software) Interface with client Components Robot Index catalog, database of what the spider finds Search Engine Software program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. Source:
8
Robot Overview It is essential ingredient of all current web search tools. A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.It could be written in Perl, Java, C, C++ or others (e.g. Tcl/Tk) Also be known as: spiders, wanderers, worms, crawlers, gatherers, intelligent agents Could have other functions: Measuring the size and scope of the web; Maintaining a database of web page by checking old links for updates and relocation; Mirroring sites; address harvesting; Etc. (Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13.
9
Robot Overview (cont.) Interested in getting source code?
(harvest, perl) (combine, perl) (Acme. spider, Java)
10
Establishing Database
“It is important to remember that when you are using a search engine, you are NOT searching the entire web as it exists at this moment. You are actually searching a portion of the web, captured in a fixed index created at an earlier date.” Source:
11
How Robot Searches Web Pages
Robot does not wander in the web itself. It use HTTP to require documents from server. “In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.” Strategy: Depth-first: create a relatively comprehensive database on a few subjects; Breadth-first: create databases touching more lightly on a wider variety of documents. (Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. )
12
If I Don’t Want to Be Indexed by Robot?
Robot.tex A plaint text document which would be checked by robots; An example; Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Robots META tag Sample entry: <META name="ROBOTS" content="NOINDEX"> <META name="ROBOTS" ontent="NOFOLLOW"> Many, but not all, search engine robots will recognize this tag and follow the rules for each page
13
Ranking I Once a search engine has used your search terms to gather "hits" from its database, it lists or "ranks" the resulting sites in order of its own estimation of their relevance. In most cases, the rule for ranking is the Relevance Prediction. Currently, search engines predict relevance based on two sets of factors: those based on a site's content ; those external to the site:
14
Ranking II Factors based on a web site's content
Word frequency (How often search terms occur in a page in relationship to other text) Location of search terms in the document (Are they in the title? Are they near the top of the page?) Relational clustering (How many pages in the site contain the search terms?) The site's design (Does it use frames? How fast does it load?)
15
Ranking III Factors external to the site
Link popularity -- Sites with more links pointing to them are prioritized Click popularity -- Sites visited more often are prioritized "Sector" popularity -- Sites visited by certain demographic or social groups are prioritized (Note: This system requires user-provided information) Business alliances among services -- Results from a partner search service are ranked higher Pay-for-placement rankings -- Site owners pay for high rankings
16
How Search Engine Different from One Another?
The Robot The database How is the database cleaned up and filtered? The frequency with which sites are spidered affects the database's freshness. The formula (Algorithms) Different search engines employ different search retrieval formulas, or algorithms, to provide relevant content in response to a user's query. Features and functionality The various search engines have different bells and whistles to appeal to searchers' different experience levels or individual tastes. The look Engines' graphical user interfaces vary, as do the format in which they present their results. 11/18/2018
17
What Are the Problems of Current Search Engines?
The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users’ horizons, they are often frustrating and consume precious time. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page
18
Challenges in Search Engines
Search Engine Spam; Quality of Content; Quality Evaluation; Web Conventions; Avoid Duplicate Search (Host); Vaguely Structured Data Source: Challenges in Search Engines, ACM SIGIR Forum, Volume 36 , Issue 2 Fall 2002 , Monika R. Henzinger , Rajeev Motwani, Craig Silverstein
19
An Example: Google
20
Some Features: What's not indexed Content and location HTML Title
Registration pages, text in graphics and multimedia files (use Alt tags), XML, Java applets, comment tags, Acrobat files, spammers. Content and location Keywords should be close to each other; Content should include keywords in text or links. HTML Title Seems to be a fact. Meta tags No. Link popularity Very important, especially from relevant page.
21
Architecture Source: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page
22
Main Technology Applied in Google:
PageRankTM: A system for ranking web pages Anchor Text Source: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page
23
The Future of Search Engine
Theme Search Engine best match between the page content and the evaluation of its page. Get more straightforward answers. More customized
24
Directories vs. Search Engines
Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores Source:
25
Reference Alison Cooke, “Authority Guide to Evaluating Information on the Internet”, 1999, Danny Sullivan, “How Search Engines Work”, 2002, Knut Risvik, “From description of the FAST search engine”, ? Berkeley University, 2002, Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13.
26
Reference http://www.searchengines.com/urlkeywords.html
Danny Sullivan, “Major Search Engines and Directories”, 2003, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, Monika R. Henzinger, Rajeev Motwani, Craig Silverstein, 2002, “Challenges in Web Search Engines”, SIGIR FORUM, Fall 2002, Vol.36, No.02. Robin Nobles, 2003, “The Future Of Search Engine Optimizing”, Gary H. Anthes, 2002, “The Future of the Search Engine”,
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.