Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines.

Slides:



Advertisements
Similar presentations
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Yang Lu COMP 1631, Winter2011. Background Yahoo’s initial name was “Jerry’s guide to the World Wide Web” Yahoo’s is an acronym for “Yet Another Hierarchical.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
(c) Maria Indrawan Distributed Information Retrieval.
Search Engines Jan Damsgaard Dept. of Informatics Copenhagen Business School
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
Overview of Search Engines
Types of behaviors of search engines uses
Internet Research Search Engines & Subject Directories.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Design Basic Concepts.
Search engines Christian Rennerskog, Jonas Rosling, Mattias Olsson.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Hotbot A Search Engine Case Study. Introduction  Owned by Terra/Lycos.  One of the largest web search engines.  Uses the Inktomi database combined.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Interfaces search engine modus operandi.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Search Engine Optimization & Pay Per Click Advertising
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Search Engines.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Internet Research – Illustrated, Fourth Edition Unit A.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Search Engines Information Technology and Social Life March 2, 2005.
Google search in general  Google Search, commonly referred to as Google Web Search or just Google, is a web search engine owned by Google Inc. It is.
 SEO Terms A few additional terms Search site: This Web site lets you search through some kind of index or directory of Web sites, or perhaps both an.
WebScan: Implementing QueryServer 2.0 Karl Geiger, Amgen Inc. BRS NA UG August 1999.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
General Architecture of Retrieval Systems 1Adrienn Skrop.
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Search Engines and Search techniques
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Search Engines & Subject Directories
Chapter 27 WWW and HTTP.
Search Engines & Subject Directories
Search Engines & Subject Directories
All About the Internet.
Presentation transcript:

Search Engine Survey Hongfei Yan 2/15/2007

2 Outline  Background Information  Definition, history, how search engines work  General Search Engines  Interface, databases, features  Google, Yahoo!, Baidu, Live  Open Source Search Engines  Lucence, SWISH-E  Metasearch, Visual, and Answer Search Engines

3 Definition of Search Engine  A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the Web, inside a corporate or proprietary network, or in a personal computer.  The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria.  This list is often sorted with respect to some measure of relevance of the results.  Search engines use regularly updated indexes to operate quickly and efficiently.  search engine usually refers to a Web search engine, which searches for information on the public Web.

4 Timeline of Search Engines “Full text” crawler-based Link popularity and PageRank

5 How search engines work  Web crawling  an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt.  Indexing  The contents of each page are analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags).  Searching  When a user comes to the search engine and makes a query, the engine looks up the index and provides a listing of best-matching web pages according to its criteria

6 Storage costs and crawling time  Storage costs are not the limiting resource in search engine implementation.  Simply storing 10 billion pages of 10 kbytes each (compressed) requires 100TB and another 100TB or so for indexes, giving a total hardware cost of under $200k: 100 cheap PCs each with four 500GB disk drives.  a public search engine requires considerably more resources than this to calculate query results and to provide high availability.  Also, the costs of operating a large server farm are not trivial.  Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection.

7 Outline  Background Information  Definition, history, how search engines work  General Search Engines  Interface, databases, features  Google, Yahoo!, Baidu, Live  Open Source Search Engines  Lucence, SWISH-E  Metasearch, Visual, and Answer Search Engines

8 General Search Engine  Primary Search Engines  they are either well-known and well-used.  they can potentially generate so much traffic. * Google * Yahoo! * Baidu * Live  Secondary Web Search Engines  These are either smaller or not the primary search engine for access to databases from the Providers of Search listed below. * Exalead * Gigablast * WiseNut  Dead Search Engines  These search engines used to offer their own database or unique search features. They have all abandoned their position in search, although they still may have some kind of search functionality. * AlltheWeb * AltaVista *Excite * Infoseek * Inktomi

9 GSE: Minimalist User Interface

10 GSE: Databases  Web:  Indexed Web pages (also includes URLs that it has not fully indexed)  and additional file types in the Web database include PDF,.ps,.doc,.xls,.txt,.ppt,.rtf,.asp and more.  Ads: Paid advertisements usually shown on the right side (or top) under a "Sponsored Links" heading

11 GSE: Google Database Components In millionspercent Indexed Web Pages 1, % Unindexed URLs 50025% Other file types % Daily Reindexed Web Pages 30.15%

12 GSE: Features  A large, unique search engine database  Includes cached copies of pages  utilize not only PageRank but more than 150 criteria to determine relevancy  Default Operation: Multiple search terms are processed as an AND operation by default. Phrase matches are ranked higher(Proximity Searching).  No truncation is available.  Case Sensitivity: using either lower or upper case results in the same hits.

13 GSE: Features contd.  Field searching  Language Limits: Default is all languages. 30+ language limits are available.  Stop Words: searches almost all words except for operators like AND.  Display:  The display includes the title,  URL,  a brief extract showing text near the search terms,  the file size,  and for many hits, a link to a cached copy of the page.

14

15

16

17

18 Review of Google  In Feb Google moved from Alpha test version to Beta and officially launched Sept. 21,  Since that time it has made its mark with its relevance ranking based on link analysis, cached pages, and aggressive growth.  Since its beta release, it has had phrase searching and the - for NOT, but it did not add an OR operation until Oct  In Dec. 2000, it added title searching.  In June 2000 it announced a database of over 560 million pages, which grew to over 600 million by the end of 2000 and then 1.5 billion in Dec  The 2+ billion reported on their home page as of April 2002 includes indexed pages, unindexed URLs, and other file formats. By Nov. 2002, they moved their claim up to 3 billion, and in Feb it went to 4 billion.  While no official claim is given, 20+ billion is once current estimate.

19 Review of Yahoo!  The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in Electrical Engineering at Stanford University, started their guide in a campus trailer in February 1994 as a way to keep track of their personal interests on the Internet. Before long they were spending more time on their home-brewed lists of favourite links than on their doctoral dissertations. Eventually, Jerry and David's lists became too long and unwieldy, and they broke them out into categories. When the categories became too full, they developed subcategories... and the core concept behind Yahoo! was born.  In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned AlltheWeb and AltaVista.  in 2004, Yahoo! launched its own search engine based on the combined technologies of its acquisitions and providing a service that gave pre-eminence to the Web search engine over the directory..

20 Review of Live  Live Search is the successor to MSN Search. This is the Microsoft Web search engine. Launched in September 2006, it uses its own, unique database.  In 2004 it debuted a beta version of its own results, powered by its own web crawler (called msnbot).  In early 2005 it started showing its own results live. At the same time, Microsoft ceased using results from Inktomi, now owned by Yahoo!.  In 2006, Microsoft migrated to a new search platform - Windows Live Search, retiring the "MSN Search" name in the process.

21 Review of Badu  Baidu (Chinese: 百度 ; pinyin: bǎi dù) is a popular Chinese search engine which launched in 2000 and can search text and images. As of January 2007, since at least as early as May 2006, it is fourth in Alexa's internet rankings with a market share of 52 percent.  Baidu provides an index of over 1 billion web pages.

22 Outline  Background Information  Definition, history, how search engines work  General Search Engines  Interface, databases, features  Google, Yahoo!, Baidu, Live  Open Source Search Engines  Lucence, SWISH-E  Metasearch, Visual, and Answer Search Engines

23 Lucene, lucene.apache.org  Lucene is a free and open source information retrieval API, originally implemented in Java by Doug Cutting. Lucene has been ported to programming languages including Perl, C#, C++, Python, Ruby and PHP.  While suitable for any application which requires full text indexing and searching capability.  At the core of Lucene's logical architecture is a notion of a document containing fields of text. This flexibility allows Lucene's API to be agnostic of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted.

24 SWISH-E, swish-e.org  Swish-e stands for Simple Web Indexing System for Humans - Enhanced. It is used to index collections of documents ranging up to one million documents in size and includes import filters for many document types.  Many sites use Swish-e

25 Outline  Background knowledge  Definition, history, how search engines work  General Search Engines  Interface, databases, features  Google, Yahoo!, Baidu, Live  Open Source Search Engines  Lucence, SWISH-E  Metasearch, Visual, and Answer Search Engines

26 Visual Search Engine  A search returns both a list of search results and a tag cloud. The tag cloud contains the original search terms surrounded by related tags. The closer to the search terms, the larger the keyword suggestions (both in terms of font size and boldness), the more relevant they are deemed. Holding the mouse over a term will display a new set of results in the bottom window and will also show another keyword cloud overlaying the original.

27 VSE: Quintura.com

28 Metasearch Engines  Unlike search engines, metacrawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page.

29 MSE: vivisimo

30 MSE: Kartoo.com

31 Answer-based search engines  Answers.com:presents reference content in over four million entries, collected from multiple sources.

32 Reference      ……