Download presentation
Presentation is loading. Please wait.
Published byPosy Green Modified over 9 years ago
1
Search Engines
2
Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol Reliability: – Data corruption : faulty way of breaking up messages – Lost data packets – Duplicate arrival IP addresses for host: – A numerical label assigned to each device participating in a computer network that uses internet protocol for communication – Hostnames (Ex: cs.umb.edu) – We prefer meaningful names but behind the scenes hostnames are converted to IP addresses which are a series of four decimal numbers separated by dots. Ex: 205.39.155.18 (stored in 32 bits). What is a potential problem with 32 bit addresses? – Can be split into network address (type/size of network) and a host number (machine/device number on this network)
3
Domain name system cs.bu.edu ? ? Domain names: The part of a hostname that specifies type of organization or group Top-level domain (TLD): The last section of a domain name specifying the type of organization or its country of origin. Domain name system is used to translate hostnames into numeric IP addresses so domain name servers (when you make a request) translates request into an IP address and then searches for the IP address
4
World Wide Web Good News: – Millions of webpages available on a variety of topics Bad News: – Millions of webpages available on a variety of topics – Haphazard labeling – Sitting on servers in various locations How do search for a specific topic?
5
Search engines! A web search engine is designed to search for information on the world wide web and FTP servers – Search based on key words (Crawl) – Keep an index of useful pages (Index) – Presents users with information based on the index (Search) Search engines operate algorithmically or a mix of algorithmic and human input
6
Search engines: History Early summer of 1993: No commercial or large scale search engines existed W3Catalog : – World’s first primitive search engine – By Oscar Nierstrasz at the University of Geneva Wandex : – Web Robot – Mathew Gray at MIT – Measure the size of World Wide Web
7
Search engines: History Aliweb : – Indexed by hand Jump Station : – Web Robot – Crawl, Index and Search 2000: Google rose to fame! – Algorithm called PageRank that ranks webpages based on the number and PageRank of links available on the website
8
Difference Early search engines – Few hundred thousand pages – One or two thousand inquiries Top search engines – Few hundred millions of pages – Billions of queries per day
9
Web Crawling Also known as a spider Special software agent that finds web pages, also follows links on web pages Contents are analyzed – Words, titles, special fields called meta tags Starting point? – Popular pages
10
Google: Web Crawling At its peak: – Use multiple spiders – Each spider can keep ~300 connections to pages at a time – Generates 600K/s Starting points: – Dedicated server that feeds URLs to spiders – Instead of relying on ISP for domain names they have their own DNS server Google spider looks at two things: – Significant words within the page – Location of the words Why is location important?
11
Meta tags Owner specific – Can be helpful – Problem? – Robot exclusion protocol
12
Indexing Spiders get the data – Now what? – Content analysis – Method by which information is sorted and stored One way: Storing the word and associated URL – No way to tell if the word is important or trivial – How many times was the word used?
13
Ranking A relationship between items about their ordering For more useful information: – Number of times word appears on page – Assign a weight to each word Each search engine has a different formula for assigning weight to words in its index Popular way of indexing : Hashing – Numerical value assigned to each word that can be retrieved using a formula
14
Building a search Query: string of words or a single word Complex queries requires the use of Boolean operators – AND : terms joined by operator, also ‘+’ – OR – NOT – FOLLOWED BY – NEAR – Quotation Marks
15
Building a search Literal searches: based on Boolean operators Concept-based: Statistical analysis on pages containing words or phrases you search for – Information stored about each page is greater – Search times may be longer Natural language queries – Ask a question : AskJeeves.com – Parses keywords
16
Money money money.. Beyond selling shares or private investment Three main methods: – Online purchases – Web advertising Keywords relating to product, service or business – Allowing users to integrate ads into their own websites – Fourth shady way: Selling user information
17
Google Company Culture Sergey Brin and Larry Page began google with a few networked computers at Stanford Multibillion dollar organization – >19000 employees globally – Market Capitalization >$145 billion Googleplex: – Free food – gourmet café stations – Snack rooms – Exercise rooms – Game rooms – Grand piano
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.