Download presentation
Presentation is loading. Please wait.
1
How Search Engines Work: A Technology Overview
Avi Rappoport Search Tools Consulting UC Berkeley SIMS class 202 September 16, 2004
2
Purpose of Search Engines
Adaptive Path Purpose of Search Engines Helping people find what they’re looking for Starts with an “information need” Convert to a query Gets results In the materials available Web pages Other formats Deep Web UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Confidential · ©2001 Adaptive Path, LLC · Fillmore Street #404 · San Francisco, California 94115
3
Search is Not a Panacea Search can’t find what’s not there
The content is hugely important Information Architecture is vital Usable sites have good navigation and structure UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
4
Search Looks Simple UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
5
But It's Not Index ahead of time Provide search forms Display results
Find files or records Open each one and read it Store each word in a searchable index Provide search forms Match the query terms with words in the index Sort documents by relevance Display results UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
6
Search Processing UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
7
Search is Mostly Invisible
Like an iceberg, 2/3 below water user interface search functionality content UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
8
Text Search vs. Database Query
Text search works for structured content Keyword search vs. SQL queries Approximate vs. exact match Multiple sources of content Response time and database resources Relevance ranking, very important Works in the real world (e.g. EBay) UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
9
Search is Only as Good as the Content
Users blame the search engine Even when the content is unavailable Understand the scope of site or intranet Kinds of information Divided sites: products / corporate info Dates Languages Sources and data silos: CMSs, databases... Update processes UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
10
Making a Searchable Index
Store text to search it later Many ways to gather text Crawl (spider) via HTTP Read files on file servers Access databases (HTTP or API) Data silos via local APIs Applications, CMSs, via Web Services Security and Access Control UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
11
Robot Indexing Diagram
Source:James Ghaphery, VCU UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
12
What the Index Needs Basic information for document or record
File name / URL / record ID Title or equivalent Size, date, MIME type Full text of item More metadata Product name, picture ID Category, topic, or subject Other attributes, for relevance ranking and display UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
13
Simple Index Diagram UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
14
More Complex Index Processing
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
15
Index Issues Stopwords Stemming Metadata Semantics Explicit (tags)
Implicit (context) Semantics CMS and Database fields XML tags and attributes UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
16
Search Query Processing
What happens after you click the search button, and before retrieval starts. Usually in this order Handle character set, maybe language Look for operators and organize the query Look for field names or metadata Extract words (just like the indexer) Deal with letter casing UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
17
Search and Retrieval Retrieval: find files with query terms
Not the same as relevance ranking Recall: find all relevant items Precision: find only relevant items Increasing one decreases the other UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
18
Retrieval = Matching Single-word queries
Find items containing that word Multi-word queries: combine lists Any: every item with any query word All: only items with every word Phrases: find only items with all words in order Boolean and complex queries Use algorithm to combine lists UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
19
Why Searches Fail Empty search
Nothing on the site on that topic (scope) Misspelling or typing mistakes Vocabulary differences Restrictive search defaults Restrictive search choices Software failure UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
20
LII.org No-Matches Page
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
21
Relevance Ranking Theory: sort the matching items, so the most relevant ones appear first Can't really know what the user wants Relevance is hard to define and situational Short queries tend to be deeply ambiguous What do people mean when they type “bank”? First 10 results are the most important The more transparent, the better UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
22
Relevance Processing Sorting documents on various criteria
Start with words matching query terms Citation and link analysis Like old library Citation Indexes Ted Nelson - not only hypertext, but the links Google PageRank Incoming links Authority of linkers Taxonomies and external metadata UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
23
TF-IDF Ranking Algorithm
Term frequency in the item Inverse document frequency of term Rare words are likely to be more important wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Dj N = number of Documents in collection n = number of Documents where term Tj occurs at least once From Salton 1989 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
24
Other Algorithms Vector space Probabilistic (binary interdependence)
Fuzzy set theory Bayesian statistical analysis Latent semantic indexing Neural networks Machine learning All require sophisticated queries See MIR, chapter 2 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
25
Relevance Heuristics Heuristics are rules of thumb
Not algorithms, not math Search Relevance Ranking Heuristics Documents containing all search words Search words as a phrase Matches in title tag Matches in other metadata Based on real-word user behavior UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
26
Search Results Interface
What users see after they click the Search button The most visible part of search Elements of the results page Page layout and navigation Results header List of results items Results footer UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
27
Many Experiments in Interface
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
28
Back to Simplicity UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
29
Search Suggestions (aka Best Bets)
Human judgment beats algorithms Great for frequent, ambiguous searches Use search log to identify best candidates Recommend good starting pages Product information, FAQs, etc. Requires human resources That means money and time More static than algorithmic search UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
30
MSU Keywords UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
31
Siemens Results UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
32
Cooks.com Results UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
33
Salon.com Results UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
34
Faceted Metadata Search & Browse
Leverage content structure database fields (i.e. cruise amenities) document metadata (news article bylines) Provide both search and browse Support information foraging Integrate navigation with results Not just subject taxonomies Display only fruitful paths, no dead ends Supported by academic research Marti Hearst, UCB SIMS, flamenco.berkeley.edu UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
35
Faceted Search: Information
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
36
Faceted Search: Online Catalog
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
37
Search Metrics and Analytics
Number of searches Number of no-matches searches Traffic from search to high-value pages Relate search changes to other metrics Search Log Analysis Top 5% searches: phrases and words Top no-matches searches Use as market research UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
38
Search Will Never Be Perfect
Search engines can’t read minds User queries are short and ambiguous Some things will help Design a usable interface Show match words in context Keep index current and complete Adjust heuristic weighting Maintain suggestions and synonyms Consider faceted metadata search UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
39
Search Engines, sorta Rocket Science
Questions and discussion Contact me This presentation: UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.