Download presentation
Presentation is loading. Please wait.
Published byRylee Mealor Modified over 9 years ago
1
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard
2
What Do People Search For? Searchers often don’t clearly understand –The problem they are trying to solve –What information is needed to solve the problem –How to ask for that information The query results from a clarification process Dervin’s “sense making”: Need GapBridge
3
Process/System Co-Design
4
Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization
5
Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts
6
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict
7
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection
8
Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search
9
Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing
10
Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge
11
“Bag of Terms” Representation Bag = a “set” that can contain duplicates “The quick brown fox jumped over the lazy dog’s back” {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order {back, brown, dog, fox, jump, lazy, over, quick, the, the} [1 1 1 1 1 1 1 1 2]
12
Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 Term Document 1Document 2 Stopword List
13
Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results
14
Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still
15
Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects
16
“Okapi” Term Weights TF componentIDF component
17
Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Optionally, get query term weights from the user –Estimate of term importance Compute inner product of query and doc vectors –Multiply corresponding elements and then add
18
Some Questions for Indexing How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?
19
The Indexing Process quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3 Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File
20
The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File
21
Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space
22
(Uncompressed) Index Size Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!
23
Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trades decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use an optimal coding scheme
24
Problems with “Free Text” Search Homonymy –Terms may have many unrelated meanings –Polysemy (related meanings) is less of a problem Synonymy –Many ways of saying (nearly) the same thing Anaphora –Alternate ways of referring to the same thing
25
Two Ways of Searching Write the document using terms to convey meaning Author Content-Based Query-Document Matching Document Terms Query Terms Construct query from terms that may appear in documents Free-Text Searcher Retrieval Status Value Construct query from available concept descriptors Controlled Vocabulary Searcher Choose appropriate concept descriptors Indexer Metadata-Based Query-Document Matching Query Descriptors Document Descriptors
26
Supervised Learning f 1 f 2 f 3 f 4 … f N v 1 v 2 v 3 v 4 … v N CvCv w 1 w 2 w 3 w 4 … w N CwCw Learner Classifier New example x 1 x 2 x 3 x 4 … x N CxCx Labelled training examples CwCw
27
Example: kNN Classifier
28
Problems with Controlled Vocabulary New concepts Users and indexers may think differently Using thesauri effectively requires training
29
Index Spam Goal: Manipulate rankings of an IR system Multiple strategies: –Create bogus user-assigned metadata –Add invisible text (font in background color, …) –Alter your text to include desired query terms –“Link exchanges” create links to your page
30
Some Observable Behaviors
31
Behavior Category
32
Minimum Scope
33
Some Examples Read/Ignored, Saved/Deleted, Replied to (Stevens, 1993) Reading time (Morita & Shinoda, 1994; Konstan et al., 1997) Hypertext Link (Brin & Page, 1998)
34
Estimating Authority from Links Authority Hub
35
Collecting Click Streams Browsing histories are easily captured –Make all links initially point to a central site Encode the desired URL as a parameter –Build a time-annotated transition graph for each user Cookies identify users (when they use the same machine) –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com
36
0 20 40 60 80 100 120 140 160 180 No Interest Low Interest Moderate Interest High Interest Rating Reading Time (seconds) Full Text Articles (Telecommunications) 50 32 58 43
37
Problems with Observed Behavior Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?
38
Putting It All Together Free TextBehaviorMetadata Topicality Quality Reliability Cost Flexibility
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.