Download presentation
Presentation is loading. Please wait.
Published byLambert Phelps Modified over 9 years ago
1
Search Engines
2
2 What Are They? Four Components A database of references to webpages An indexing robot that crawls the WWW An interface Enables users to submit queries Displays results Information retrieval system Each is unique, but are mostly the same
3
3 Database Where user's query is matched Contains only essential parts of pages Only includes pages that were indexed Search engines are always out of date
4
4 Web Crawler A robot that follows links Records data it finds Words in the webpage Metadata ALT attributes in IMG tags Robot Exclusion Protocol Robot Exclusion Protocol
5
5 Search Engine Interfaces Gathers input from users Presents results from the IR system Often in ranked order
6
6 Search Engine Interfaces Input User requirements Search expression, search limits Presentation style Presentation format, search type
7
7 Search Engine Interfaces Output Results Descriptions Clusters
8
Example: Visual Clustering Interface 8
9
Large Example: Clustering Visual Interface 9 Grokker
10
10 Search Term Matching Trying to find a match in the database Two main methods Keyword searching Matching single terms, computing cosine Concept-based searching Examining clusters of words Attempt to determine meaning of query and find records related to that meaning
11
11 Basic IR Features Boolean operators AND, OR, NOT, grouping Extended operators NEAR, ADJACENT, (") Stop word deletion Stemming Searching in fields (e.g. host)
12
12 Ranked Output Most SEs produce ranked lists by applying simple rules: Early words are more important Title is very important Frequency of occurrence matters for some Infrequent words matter more Modification date Google is different: Google PageRank TM method based on popularity Links as money
13
13 Googlebombing Google spoofed from the lecture list Google spoofed first hit from 1992 first hit Official GoogleBlog explanation Official GoogleBlog explanation
14
14 What about the Invisible Web? Also known as the Deep Web Documents that are on the WWW but not indexed by Search Engines Some are available only by submitting forms Some are not generally accessible (in subnets) Some are not in (X)HTML format
15
15 The Invisible Web Isn't So Invisible Anymore… More search engines parse non- (X)HTML now than before Because of awareness of the problem companies are making more content available using Stable URLs Robot-friendly sitemaps But much content is still not indexed
16
16 But, there's still plenty of important yet invisible docs How to find them? Many of them are in databases No one search engine covers everything Use database tools from the U.'s library Especially for research articles Use multiple search engines or a meta- crawler dogpile is the most famous
17
Search Engines A Summary of Practical Advice
18
18 How To Succeed With SEs As a surfer: If you don't know what you are looking for Use multiple SEs, or a meta-crawler Search within results If you don't know what you are looking for Use multiple SEs, or a meta-crawler Use Boolean expressions or search within results Consider specialized engines
19
19 How To Succeed With SEs As a creator: HTML level Always use ALT attributes with, etc. Avoid frames Make it easier to index Don't expect SEs to find your pages Make links between your pages Use metadata Informal: Formal: Dublin core and others Increase your pages popularity Don’t use systematic reciprocal linking: rings, exchanges, lists Page Rank™ is inversely proportional to outdegree
20
20 How To Succeed With SEs As a creator (cont.) For surfers: Use Don't expect surfers to start at top of your hierarchy Don't rely on a hierarchy Include a context map near the top of each page Don't use frames Think through dynamic content implications Stickiness… is for another day
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.