The Players The Majors Dead Search Engines International Search Engines Metasearch Engines
Google Developed as BackRub by Stanford University students Larry Page and Sergey Brin Became a private company, and changed name to Google in 1998 One of largest databases >8 billion (they include pages their robots have searched, even if their indexing program hasn’t fully indexed it) Indexes 3 billion pages every 28 days; 3 million every day Makes money through powering over 130 portals and Corporate Web sites, and AdWords
Google Google Spidering Uses its own ‘bots to spider web Generally ignores meta keywords and description tags.
Google Google Indexing Descriptions (snippets) are formed automatically by extracting the most relevant portions of pages Finds the first instance of the search term on a page, then includes the words that appear around this term Only indexes first 100K or so Some pages don’t have a description - Google will include a “botted” page even if it has not been “indexed”
Google Indexes: Web - Indexed Web pages and other file types Ads - Paid advertisements appear on the right side or above search results under a "Sponsored Links" heading Images million+ images searched Groups million+ usenet messages searched News Directory - A ranked version of the Open Directory using Google's PageRank Froogle - Shopping and product search Catalog Search - Scanned, searchable retail catalogs
Google Web index subsets: Government sites Military sites University sites Linux sites Apple/Macintosh sites Microsoft sites
Google New! “Google teams with the libraries of Harvard, Stanford, the University of Michigan, the University of Oxford, and The New York Public Library to digitally scan books from their collections so that users worldwide can search them in Google…Users searching with Google will see links in their search results page when there are books relevant to their query. Clicking on a title delivers a Google Print page where users can browse the full text of public domain works and brief excerpts and/or bibliographic data of copyrighted material. Library content will be displayed in keeping with copyright law.”
Yahoo! Search Originally just a subject directory Search engine launched Feb Indexes first 500 KB of a Web page Includes some pay for inclusion sites
Teoma Founded in 2000 by a team of scientists from Rutgers University Teoma means "expert" in Gaelic Acquired by Ask Jeeves, Inc. in September 2001.
Teoma More than 2 billion English-only web documents Spam, duplicates and pornographic results removed from index Indexes whole page; no stop words Considers meta-tag descriptions Aims to re-index every month (freshness) Sponsored links from Google Adwords
Teoma Establishing authority and relevancy: Refine - organizes sites into naturally occurring communities that are about the subject of each search query Results - analyzes the relationship of sites within a community, ranking a site based on the number of same-subject pages that reference it (Subject-Specific Popularity) Resources - identifies expert resources about a particular subject
Gigablast Founded in 2000 Built and operated by sole proprietor Matt Wells Created to index up to 200 Billion pages with the least amount of hardware possible Currently indexes 650 million Provides "Gigabits” to help searchers refine their search based upon related topics from search results Makes money by selling search services to private companies
Wisenut Newer database ~ million pages indexed 1.5 billion – identified not crawled/indexed Few advanced search features Spider capable of fetching more than 100 million a day Often months out of date Smart/Relevant: all words on page, text or referring links and words around them, significance and content of pages with the links Generates automatic semantic searches called WiseGuide categories
MSN Search New, improved ~4.2 billion pages search/indexed? Formerly used Inktomi, now has proprietary robots, indexer, and retrieval engine
Dead Search Engines What ever happened to…? Direct Hit - defunct, redirecting to Teoma Infoseek – defunct, redirecting to Go Magellan - dead, redirects to WebCrawler Northern Light - defunct Openfind - Under "reconstruction" as of 2003 WebTop - Dead
Dead Search Engines The search engine formerly know as… AlltheWeb - uses Yahoo! database AltaVista - uses Yahoo! database Excite - uses an InfoSpace meta search Go - took over Infoseek, but now just uses Overture iWon – now uses Google "sponsored" ads, web, and image databases Looksmart - uses Wisenut search engine Lycos - uses Yahoo!/Inktomi database and LookSmart directory NBCi (formerly Snap) - uses metasearch engine Dogpile WebCrawler - uses an InfoSpace meta search
International Search Engines There are hundreds of search engines all over the world. We will not be investigating any of these very closely, but you can use the resources below to locate and master international search engines: All Search Engines: foreign search engines Search Engines Worldwide Search Engine Colossus Country-specific Search Engines
Metasearch Engines A search engine that queries other search engines and then combines the results that are received from all Allows user is not using just one search engine but a combination of many search engines at once to optimize Web searching
Metasearch Engines The difference among them: Engines covered (many pay-for-placement) # of engines that can be searched at once Sophistication of search query # of records from each search engine Length of time it will search each search engine Delete duplicates (de-duping)
Metasearch Engines Dogpile Metacrawler Mamma Kart00 Clusty Surfwax Ixquick Fazzle InfoGrid Gimenei
Metasearch Engines Good for getting a lay of the land: What is out there? Is there anything out there? Who covers a topic best? Learning the names of new or emerging search engines
Metasearch Engines Otherwise, usually better off searching multiple SE’s individually: Syntax varies among search engines and metasearch engines may not allow you to make use of all search engines May not translate your query well into different SE’s
Metasearch Engines Check out some cool, value-adding features emerging is metasearch engines
Clusty Clusty (using Vivisimo clustering engine): Clustering: uses algorithm to put search results together based on textual and linguistic similarity. Groups further refined using heuristics (i.e., human knowledge) designed to show what users wish to see when they examine clustered documents.
Clusty “Vivísimo's Clustering Engine lets you see deeper and farther--with less effort--into a large number of search results to: Get a quick overview of the main themes that relate to the query. See similar results grouped together for faster access. Find results that are buried in the ranked list and would otherwise be missed. Discover unexpected results and relationships between items.”
Mamma rSort Considers each listing duplicated in more than one SE as a “vote” for that page. Uses votes to rank pages per the "Condorcet Method“ One of the big advantages of this ranking method is the elimination of search engine spam.
Kart00 Interactive Mapping display for results Uses proprietary algorithm to sort pages Relevance of results are displayed as different-sized pages When you move the pointer over these pages, the relevant keywords are illuminated and a brief description of the site appears on the left side of the screen Click keywords to refine the search Refined or further results also displayed on a map
Surfwax Targeted multi-source searching Searches only sources from specific domains or topics determined as relevant SurfWax can spider deeper in any site public site, including pages or parts that are invisible to traditional search engines Uses a site's existing search syntax to uncover “deeper” content
Ixquick Understands and translates, when possible, complex syntax Complete Boolean searching Truncation/wildcard searching
Fazzle Meta-searches SE’s, plus unique searches in news and other invisible web resources Ranks everything together Delivers timely resources from news sources Delivers dynamic content missing from other metasearch engines