Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.

Similar presentations


Presentation on theme: "Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables."— Presentation transcript:

1 Search Engines

2 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables users to submit queries  Displays results  Information retrieval system  Each is unique, but are mostly the same

3 3 Database  Where user's query is matched  Contains only essential parts of pages  Only includes pages that were indexed  Search engines are always out of date

4 4 Web Crawler  A robot that follows links  Records data it finds  Words in the webpage  Metadata  ALT attributes in IMG tags  Robot Exclusion Protocol Robot Exclusion Protocol

5 5 Search Engine Interfaces  Gathers input from users  Presents results from the IR system  Often in ranked order

6 6 Search Engine Interfaces  Input  User requirements  Search expression, search limits  Presentation style  Presentation format, search type

7 7 Search Engine Interfaces  Output  Results  Descriptions  Clusters

8 Example: Visual Clustering Interface 8

9 Large Example: Clustering Visual Interface 9 Grokker

10 10 Search Term Matching  Trying to find a match in the database  Two main methods  Keyword searching  Matching single terms, computing cosine  Concept-based searching  Examining clusters of words  Attempt to determine meaning of query and find records related to that meaning

11 11 Basic IR Features  Boolean operators  AND, OR, NOT, grouping  Extended operators  NEAR, ADJACENT, (")  Stop word deletion  Stemming  Searching in fields (e.g. host)

12 12 Ranked Output  Most SEs produce ranked lists by applying simple rules:  Early words are more important  Title is very important  Frequency of occurrence matters for some  Infrequent words matter more  Modification date  Google is different: Google  PageRank TM method based on popularity  Links as money

13 13 Googlebombing  Google spoofed from the lecture list Google spoofed  first hit from 1992 first hit  Official GoogleBlog explanation Official GoogleBlog explanation

14 14 What about the Invisible Web?  Also known as the Deep Web  Documents that are on the WWW but not indexed by Search Engines  Some are available only by submitting forms  Some are not generally accessible (in subnets)  Some are not in (X)HTML format

15 15 The Invisible Web Isn't So Invisible Anymore…  More search engines parse non- (X)HTML now than before  Because of awareness of the problem companies are making more content available using  Stable URLs  Robot-friendly sitemaps  But much content is still not indexed

16 16 But, there's still plenty of important yet invisible docs  How to find them?  Many of them are in databases  No one search engine covers everything  Use database tools from the U.'s library  Especially for research articles  Use multiple search engines or a meta- crawler  dogpile is the most famous

17 Search Engines A Summary of Practical Advice

18 18 How To Succeed With SEs  As a surfer:  If you don't know what you are looking for  Use multiple SEs, or a meta-crawler  Search within results  If you don't know what you are looking for  Use multiple SEs, or a meta-crawler  Use Boolean expressions or search within results  Consider specialized engines

19 19 How To Succeed With SEs  As a creator:  HTML level  Always use ALT attributes with, etc.  Avoid frames  Make it easier to index  Don't expect SEs to find your pages  Make links between your pages  Use metadata  Informal:  Formal: Dublin core and others  Increase your pages popularity  Don’t use systematic reciprocal linking: rings, exchanges, lists  Page Rank™ is inversely proportional to outdegree

20 20 How To Succeed With SEs  As a creator (cont.)  For surfers:  Use  Don't expect surfers to start at top of your hierarchy  Don't rely on a hierarchy  Include a context map near the top of each page  Don't use frames  Think through dynamic content implications  Stickiness… is for another day


Download ppt "Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables."

Similar presentations


Ads by Google