Search Engine 101 Qu, Miao Nov. 2003.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
How Search Engines Work Source:
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Search Engine Optimization (SEO) Week 07 Dynamic Web TCNJ Jean Chu.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Search Engines AGCM 4143 Electronic Communications in Agriculture.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet October 30, The Internet URL’s Search Engines Boolean Operators Internet Searches Scavenger Hunt.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Business Model of Google MBAA 609 R. Nakatsu.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Engines.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Web Search Architecture & The Deep Web
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Search Engine Optimization
Information Retrieval in Practice
Search Engines and Search techniques
Understand Internet Search Tools
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Fred Dirkse CEO, OIC Group, Inc.
Search Search Engines Search Engine Optimization Search Interfaces
Chapter 27 WWW and HTTP.
1.01- Understand Internet search tools and methods.
1.01- Understand Internet search tools and methods.
What is a Search Engine EIT, Author Gay Robertson, 2017.
Data Mining Chapter 6 Search Engines
1.01- Understand Internet search tools and methods.
Search Engines & Subject Directories
1.01- Understand Internet search tools and methods.
Search Engines & Subject Directories
Web Search Engines.
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
1.01- Understand Internet search tools and methods.
1.01- Understand Internet search tools and methods.
Presentation transcript:

Search Engine 101 Qu, Miao Nov. 2003

Agenda Definition and Types Architecture Robot Overview How Search Engine Works? Problems of Current Search Engines An example: Google The Future of Search Engine Search Engine vs. Directory Reference

What Is Search Engine? Search engines are tools that use computer programs called Spiders and Robots to gather information automatically. They can create specific databases according to the query of the user. Source: “Authority Guide to Evaluating Information on the Internet”, Alison Cooke, 1999

The Types of Search Engines Individual Search Engines compile their own searchable databases on the web. Google. Meta Search Engines do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously Metacrawler, vivisimo Source: http://www.sc.edu/beaufort/library/lesson1.html

Web Search Engine Layers From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Standard Web Search Engine Architecture Check for duplicates, store the documents DocIds crawl the web user query create an inverted index Inverted index Search engine servers Show results To user Source:http://www.sims.berkeley.edu/academics/courses/is202/f02/Lectures/Lecture22_2002_11_14_tbd.ppt

Anatomy of Search Engine Working Procedures Crawling the web (Robot) Establish the database (Robot) Query (searcher) Search the Database (Search Engine Software) Ranking (Search Engine Software) Interface with client Components Robot Index catalog, database of what the spider finds Search Engine Software program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. Source:http://searchenginewatch.com/webmasters/article.php/2168031

Robot Overview It is essential ingredient of all current web search tools. A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.It could be written in Perl, Java, C, C++ or others (e.g. Tcl/Tk) Also be known as: spiders, wanderers, worms, crawlers, gatherers, intelligent agents Could have other functions: Measuring the size and scope of the web; Maintaining a database of web page by checking old links for updates and relocation; Mirroring sites; Email address harvesting; Etc. (Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. http://www.robotstxt.org/wc/faq.html#what)

Robot Overview (cont.) Interested in getting source code? http://webharvest.sourceforge.net/ng/ (harvest, perl) http://www.lub.lu.se/combine/ (combine, perl) http://www.acme.com/java/software/Acme.Spider.html (Acme. spider, Java)

Establishing Database “It is important to remember that when you are using a search engine, you are NOT searching the entire web as it exists at this moment. You are actually searching a portion of the web, captured in a fixed index created at an earlier date.” Source: http://www.sc.edu/beaufort/library/lesson1.html

How Robot Searches Web Pages Robot does not wander in the web itself. It use HTTP to require documents from server. “In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.” Strategy: Depth-first: create a relatively comprehensive database on a few subjects; Breadth-first: create databases touching more lightly on a wider variety of documents. (Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. http://www.robotstxt.org/wc/faq.html#what )

If I Don’t Want to Be Indexed by Robot? Robot.tex A plaint text document which would be checked by robots; An example; Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Robots META tag Sample entry: <META name="ROBOTS" content="NOINDEX"> <META name="ROBOTS" ontent="NOFOLLOW"> Many, but not all, search engine robots will recognize this tag and follow the rules for each page http://www.searchtools.com/robots/

Ranking I Once a search engine has used your search terms to gather "hits" from its database, it lists or "ranks" the resulting sites in order of its own estimation of their relevance. In most cases, the rule for ranking is the Relevance Prediction. Currently, search engines predict relevance based on two sets of factors: those based on a site's content ; those external to the site: http://www.searchengines.com/searchBasics1.html

Ranking II Factors based on a web site's content Word frequency (How often search terms occur in a page in relationship to other text) Location of search terms in the document (Are they in the title? Are they near the top of the page?) Relational clustering (How many pages in the site contain the search terms?) The site's design (Does it use frames? How fast does it load?) http://www.searchengines.com/searchBasics1.html

Ranking III Factors external to the site Link popularity -- Sites with more links pointing to them are prioritized Click popularity -- Sites visited more often are prioritized "Sector" popularity -- Sites visited by certain demographic or social groups are prioritized (Note: This system requires user-provided information) Business alliances among services -- Results from a partner search service are ranked higher Pay-for-placement rankings -- Site owners pay for high rankings http://www.searchengines.com/searchBasics1.html

How Search Engine Different from One Another? The Robot The database How is the database cleaned up and filtered? The frequency with which sites are spidered affects the database's freshness. The formula (Algorithms) Different search engines employ different search retrieval formulas, or algorithms, to provide relevant content in response to a user's query. Features and functionality The various search engines have different bells and whistles to appeal to searchers' different experience levels or individual tastes. The look Engines' graphical user interfaces vary, as do the format in which they present their results. 11/18/2018 http://www.searchengines.com/searchDiffer1.html

What Are the Problems of Current Search Engines? The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users’ horizons, they are often frustrating and consume precious time. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Challenges in Search Engines Search Engine Spam; Quality of Content; Quality Evaluation; Web Conventions; Avoid Duplicate Search (Host); Vaguely Structured Data Source: Challenges in Search Engines, ACM SIGIR Forum, Volume 36 ,  Issue 2   Fall 2002 , Monika R. Henzinger , Rajeev Motwani, Craig Silverstein

An Example: Google http://www.bu.edu/mfeldman/Google/

Some Features: What's not indexed Content and location HTML Title Registration pages, text in graphics and multimedia files (use Alt tags), XML, Java applets, comment tags, Acrobat files, spammers. Content and location Keywords should be close to each other; Content should include keywords in text or links. HTML Title Seems to be a fact. Meta tags No. Link popularity Very important, especially from relevant page.

Architecture Source: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page

Main Technology Applied in Google: PageRankTM: A system for ranking web pages Anchor Text Source: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page

The Future of Search Engine Theme Search Engine best match between the page content and the evaluation of its page. Get more straightforward answers. More customized

Directories vs. Search Engines Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores Source:http://www.sims.berkeley.edu/academics/courses/is202/f02/Lectures/Lecture22_2002_11_14_tbd.ppt

Reference Alison Cooke, “Authority Guide to Evaluating Information on the Internet”, 1999, http://www.sc.edu/beaufort/library/lesson1.html Danny Sullivan, “How Search Engines Work”, 2002, http://searchenginewatch.com/webmasters/article.php/2168031. Knut Risvik, “From description of the FAST search engine”, ? http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm Berkeley University, 2002, Source:http://www.sims.berkeley.edu/academics/courses/is202/f02/Lectures/Lecture22_2002_11_14_tbd.ppt http://www.robotstxt.org/wc/faq.html#what) Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. http://www.searchtools.com/robots/ http://www.searchengines.com/searchBasics1.html http://www.searchengines.com/textkeywords.htm

Reference http://www.searchengines.com/urlkeywords.html http://www.searchengines.com/ranking_factors.html http://www.searchengines.com/searchDiffer1.html www.google.com Danny Sullivan, “Major Search Engines and Directories”, 2003, http://searchenginewatch.com/links/article.php/2156221 www.alltheweb.com http://www.searchengines.com/partnerships.html The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, http://google.stanford.edu/~backrub/google.html. Monika R. Henzinger, Rajeev Motwani, Craig Silverstein, 2002, “Challenges in Web Search Engines”, SIGIR FORUM, Fall 2002, Vol.36, No.02. Robin Nobles, 2003, “The Future Of Search Engine Optimizing”, http://www.searchengineworkshops.com/articles/se-optimization-future.html Gary H. Anthes, 2002, “The Future of the Search Engine”, http://www.computerworld.com/databasetopics/data/story/0,10801,70037,00.html