Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Chapter 5: Introduction to Information Retrieval

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.

LBSC 796/INFM 718R: Week 5 Indexing Jimmy Lin College of Information Studies University of Maryland Monday, February 27, 2006.

A Whirlwind Tour of Search Engine Design Issues CMSC 498W April 4, 2006 Douglas W. Oard.

Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Information Retrieval Review

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.

Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004.

Evidence from Metadata LBSC 796/INFM 718R Session 9: November 5, 2007 Douglas W. Oard.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.

The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.

Search Session 12 LBSC 690 Information Technology.

Ch 4: Information Retrieval and Text Mining

Evidence from Behavior LBSC 796/INFM 719R Douglas W. Oard Session 7, October 22, 2007.

Information Retrieval Interaction CMSC 838S Douglas W. Oard April 27, 2006.

Hinrich Schütze and Christina Lioma

Information Retrieval IR 4. Plan This time: Index construction.

Advance Information Retrieval Topics Hassan Bashiri.

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Search Engines Session 11 LBSC 690 Information Technology.

Information Retrieval

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Indexing and Complexity. Agenda Inverted indexes Computational complexity.

The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.

Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.

Evidence from Metadata LBSC 796/INFM 718R Session 9: April 6, 2011 Douglas W. Oard.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Data Structures & Algorithms and The Internet: A different way of thinking.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.

The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.

Structure of IR Systems INST 734 Module 1 Doug Oard.

Web Search Module 6 INST 734 Doug Oard. Agenda The Web Crawling  Web search.

Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.

Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Vector Space Models.

Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Evidence from Content INST 734 Module 2 Doug Oard.

Evidence from Content INST 734 Module 2 Doug Oard.

Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.

Evidence from Metadata INST 734 Doug Oard Module 8.

Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.

Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Search Strategies Session 7 INST 301 Introduction to Information Science.

Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.

Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.

Why indexing? For efficient searching of a document

Indexing & querying text

Information Retrieval and Web Design

Structure of IR Systems

Presentation transcript:

Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard

What Do People Search For? Searchers often don’t clearly understand –The problem they are trying to solve –What information is needed to solve the problem –How to ask for that information The query results from a clarification process Dervin’s “sense making”: Need GapBridge

Process/System Co-Design

Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge

“Bag of Terms” Representation Bag = a “set” that can contain duplicates  “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order  {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [ ]

Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party Term Document 1Document 2 Stopword List

Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results

Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still

Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects

“Okapi” Term Weights TF componentIDF component

Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Optionally, get query term weights from the user –Estimate of term importance Compute inner product of query and doc vectors –Multiply corresponding elements and then add

Some Questions for Indexing How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?

The Indexing Process quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1Doc Doc 3 Doc Doc 5Doc Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space

(Uncompressed) Index Size Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!

Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trades decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use an optimal coding scheme

Problems with “Free Text” Search Homonymy –Terms may have many unrelated meanings –Polysemy (related meanings) is less of a problem Synonymy –Many ways of saying (nearly) the same thing Anaphora –Alternate ways of referring to the same thing

Two Ways of Searching Write the document using terms to convey meaning Author Content-Based Query-Document Matching Document Terms Query Terms Construct query from terms that may appear in documents Free-Text Searcher Retrieval Status Value Construct query from available concept descriptors Controlled Vocabulary Searcher Choose appropriate concept descriptors Indexer Metadata-Based Query-Document Matching Query Descriptors Document Descriptors

Supervised Learning f 1 f 2 f 3 f 4 … f N v 1 v 2 v 3 v 4 … v N CvCv w 1 w 2 w 3 w 4 … w N CwCw Learner Classifier New example x 1 x 2 x 3 x 4 … x N CxCx Labelled training examples CwCw

Example: kNN Classifier

Problems with Controlled Vocabulary New concepts Users and indexers may think differently Using thesauri effectively requires training

Index Spam Goal: Manipulate rankings of an IR system Multiple strategies: –Create bogus user-assigned metadata –Add invisible text (font in background color, …) –Alter your text to include desired query terms –“Link exchanges” create links to your page

Some Observable Behaviors

Behavior Category

Minimum Scope

Some Examples Read/Ignored, Saved/Deleted, Replied to (Stevens, 1993) Reading time (Morita & Shinoda, 1994; Konstan et al., 1997) Hypertext Link (Brin & Page, 1998)

Estimating Authority from Links Authority Hub

Collecting Click Streams Browsing histories are easily captured –Make all links initially point to a central site Encode the desired URL as a parameter –Build a time-annotated transition graph for each user Cookies identify users (when they use the same machine) –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com

No Interest Low Interest Moderate Interest High Interest Rating Reading Time (seconds) Full Text Articles (Telecommunications)

Problems with Observed Behavior Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?

Putting It All Together Free TextBehaviorMetadata Topicality Quality Reliability Cost Flexibility