Presentation is loading. Please wait.

Presentation is loading. Please wait.

2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003

Similar presentations


Presentation on theme: "2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003"— Presentation transcript:

1 2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/is202/f03/ SIMS 202: Information Organization and Retrieval Lecture 17: Boolean IR and Text Processing

2 2003.10.28 - SLIDE 2IS 202 – FALL 2003 Announcements Wishter volunteers meeting tonight 7:00 Testers needed!! – UI Tests on Image Gallery/ Annotation software Thursday between 2-4 and Friday 10-4. –The tests will be approximately 1 ½ hours (but most likely will run a bit shorter.) –Signup sheet will be available at the end of class

3 2003.10.28 - SLIDE 3IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic and Boolean IR Systems Text Processing Discussion Credit for some of the slides in this lecture goes to Marti Hearst

4 2003.10.28 - SLIDE 4IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic and Boolean IR Systems Text Processing Discussion Credit for some of the slides in this lecture goes to Marti Hearst

5 2003.10.28 - SLIDE 5IS 202 – FALL 2003 IR is an Iterative Process Repositories Workspace Goals

6 2003.10.28 - SLIDE 6IS 202 – FALL 2003 Berry-Picking Model Q0 Q1 Q2 Q3 Q4 Q5 A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

7 2003.10.28 - SLIDE 7IS 202 – FALL 2003 Restricted Form of the IR Problem The system has available only pre- existing, “canned” text passages Its response is limited to selecting from these passages and presenting them to the user It must select, say, 10 or 20 passages out of millions or billions!

8 2003.10.28 - SLIDE 8IS 202 – FALL 2003 Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval

9 2003.10.28 - SLIDE 9IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic and Boolean IR Systems Text Processing Discussion Credit for some of the slides in this lecture goes to Marti Hearst

10 2003.10.28 - SLIDE 10IS 202 – FALL 2003 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

11 2003.10.28 - SLIDE 11IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic and Boolean IR Systems Text Processing Discussion Credit for some of the slides in this lecture goes to Marti Hearst

12 2003.10.28 - SLIDE 12IS 202 – FALL 2003 Central Concepts in IR Documents Queries Collections Evaluation Relevance

13 2003.10.28 - SLIDE 13IS 202 – FALL 2003 Documents What do we mean by a document? –Full document? –Document surrogates? –Pages? Buckland (JASIS, Sept. 1997) “What is a Document” Are IR systems better called Document Retrieval systems? A document is a representation of some aggregation of information, treated as a unit

14 2003.10.28 - SLIDE 14IS 202 – FALL 2003 Collection A collection is some physical or logical aggregation of documents –A database –A Library –An index? –Others?

15 2003.10.28 - SLIDE 15IS 202 – FALL 2003 Queries A query is some expression of a user’s information needs Can take many forms –Natural language description of need –Formal query in a query language Queries may not be accurate expressions of the information need –Differences between conversation with a person and formal query expression

16 2003.10.28 - SLIDE 16IS 202 – FALL 2003 Evaluation: Why Evaluate? Determine if the system is desirable Make comparative assessments Others?

17 2003.10.28 - SLIDE 17IS 202 – FALL 2003 What To Evaluate? How much of the information need was satisfied How much was learned about a topic Incidental learning –How much was learned about the collection –How much was learned about other topics How inviting the system is…

18 2003.10.28 - SLIDE 18IS 202 – FALL 2003 What To Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of information –Form of presentation –Effort required/ease of use –Time and space efficiency –Recall Proportion of relevant material actually retrieved –Precision Proportion of retrieved material actually relevant Effectiveness

19 2003.10.28 - SLIDE 19IS 202 – FALL 2003 Relevance (revisited) “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.” »Saracevic, 1975 p. 324

20 2003.10.28 - SLIDE 20IS 202 – FALL 2003 Relevance How relevant is the document –For this user, for this information need Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query? How well does it answer the question? –Complete answer? Partial? –Background information? –Hints for further exploration?

21 2003.10.28 - SLIDE 21IS 202 – FALL 2003 Relevance Research and Thought Review to 1975 by Saracevic Reconsideration of user-centered relevance by Schamber, Eisenberg and Nilan, 1990 Special Issue of JASIS on relevance (April 1994, 45(3))

22 2003.10.28 - SLIDE 22IS 202 – FALL 2003 Saracevic Relevance is considered as a measure of effectiveness of the contact between a source and a destination in a communications process –Systems view –Destinations view –Subject Literature view –Subject Knowledge view –Pertinence –Pragmatic view

23 2003.10.28 - SLIDE 23IS 202 – FALL 2003 Define Your Own Relevance As we saw last time most definitions of relevance follow a “formula”: –Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor From Saracevic, 1975 and Schamber 1990

24 2003.10.28 - SLIDE 24IS 202 – FALL 2003 Schamber, Eisenberg and Nilan “Relevance is the measure of retrieval performance in all information systems, including full-text, multimedia, question- answering, database management and knowledge-based systems.” Systems-oriented relevance: Topicality

25 2003.10.28 - SLIDE 25IS 202 – FALL 2003 Schamber, et al. Conclusions “Relevance is a multidimensional concept whose meaning is largely dependent on users’ perceptions of information and their own information need situations Relevance is a dynamic concept that depends on users’ judgments of the quality of the relationship between information and information need at a certain point in time. Relevance is a complex but systematic and measurable concept if approached conceptually and operationally from the user’s perspective.”

26 2003.10.28 - SLIDE 26IS 202 – FALL 2003 Janes’ View Topicality Pertinence Relevance Utility Satisfaction

27 2003.10.28 - SLIDE 27IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic and Boolean IR Systems Text Processing Discussion Credit for some of the slides in this lecture goes to Marti Hearst

28 2003.10.28 - SLIDE 28IS 202 – FALL 2003 Query Languages A way to express the question (information need) Types: –Boolean –Natural Language –Stylized Natural Language –Form-Based (GUI)

29 2003.10.28 - SLIDE 29IS 202 – FALL 2003 Simple Query Language: Boolean –Terms + Connectors (or operators) –Terms Words Normalized (stemmed) words Phrases Thesaurus terms –Connectors AND OR NOT

30 2003.10.28 - SLIDE 30IS 202 – FALL 2003 Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

31 2003.10.28 - SLIDE 31IS 202 – FALL 2003 Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations works:

32 2003.10.28 - SLIDE 32IS 202 – FALL 2003 Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations works:

33 2003.10.28 - SLIDE 33IS 202 – FALL 2003 Boolean Logic AB

34 2003.10.28 - SLIDE 34IS 202 – FALL 2003 Boolean Queries Usually expressed as INFIX operators in IR –((a AND b) OR (c AND b)) NOT is UNARY PREFIX operator –((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators –(a AND b AND c AND d) Some rules - (De Morgan revisited) –NOT(a) AND NOT(b) = NOT(a OR b) –NOT(a) OR NOT(b)= NOT(a AND b) –NOT(NOT(a)) = a

35 2003.10.28 - SLIDE 35IS 202 – FALL 2003 Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D 10 11 D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

36 2003.10.28 - SLIDE 36IS 202 – FALL 2003 Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: Cracks AND Beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

37 2003.10.28 - SLIDE 37IS 202 – FALL 2003 Pseudo-Boolean Queries A new notation, from web search –+cat dog +collar leash Does not mean the same thing! Need a way to group combinations Phrases: –“stray cat” AND “frayed collar” –+“stray cat” + “frayed collar”

38 2003.10.28 - SLIDE 38IS 202 – FALL 2003 Another View of IR Information Need Index Pre-Process Parse Collections Rank Query Text Input

39 2003.10.28 - SLIDE 39IS 202 – FALL 2003 Result Sets Run a query, get a result set Two choices –Reformulate query, run on entire collection –Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documents

40 2003.10.28 - SLIDE 40IS 202 – FALL 2003 Feedback Queries Query Collections Text Input Reformulated Query Re-Rank Information Need Pre-Process Index Parse Rank

41 2003.10.28 - SLIDE 41IS 202 – FALL 2003 Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –Order chronologically –Order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated –p-norm is most famous Usually impractical to implement Usually hard for user to understand

42 2003.10.28 - SLIDE 42IS 202 – FALL 2003 Boolean Advantages –Simple queries are easy to understand –Relatively easy to implement Disadvantages –Difficult to specify what is wanted –Too much returned, or too little –Ordering not well determined Dominant language in commercial systems until the WWW

43 2003.10.28 - SLIDE 43IS 202 – FALL 2003 Faceted Boolean Query Strategy: Break query into facets (polysemous with earlier meaning of facets) –Conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 –Each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND

44 2003.10.28 - SLIDE 44IS 202 – FALL 2003 Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking –Order results in terms of how many facets (disjuncts) are satisfied –Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: Assign weights to facets

45 2003.10.28 - SLIDE 45IS 202 – FALL 2003 Proximity Searches Proximity: Terms occur within K positions of one another –pen w/5 paper A “Near” function can be more vague –near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations –“United Nations” “Bill Clinton” Phrase Variants –“retrieval of information” “information retrieval”

46 2003.10.28 - SLIDE 46IS 202 – FALL 2003 Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata –Restrict by: Date range Internet domain (.edu.com.berkeley.edu) Author Size Limit number of documents returned

47 2003.10.28 - SLIDE 47IS 202 – FALL 2003 Boolean Systems Most of the commercial database search systems that pre-date the WWW are based on Boolean search –Dialog, Lexis-Nexis, etc. Most Online Library Catalogs are Boolean systems –E.g., MELVYL Database systems use Boolean logic for searching Many of the search engines sold for intranet search of web sites are Boolean

48 2003.10.28 - SLIDE 48IS 202 – FALL 2003 Why Boolean? Easy to implement Efficient searching across very large databases Easy to explain results –“Has to have all of the words…” (AND) –“Has to have at least one of the words…” (OR)

49 2003.10.28 - SLIDE 49IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic and Boolean IR Systems Text Processing Discussion Credit for some of the slides in this lecture goes to Marti Hearst

50 2003.10.28 - SLIDE 50IS 202 – FALL 2003 Content Analysis Automated Transformation of raw text into a form that represents some aspect(s) of its meaning Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization

51 2003.10.28 - SLIDE 51IS 202 – FALL 2003 Techniques for Content Analysis Statistical –Single Document –Full Collection Linguistic –Syntactic –Semantic –Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)

52 2003.10.28 - SLIDE 52IS 202 – FALL 2003 Text Processing Standard Steps: –Recognize document structure Titles, sections, paragraphs, etc. –Break into tokens Usually space and punctuation delineated Special issues with Asian languages –Stemming/morphological analysis –Store in inverted index (to be discussed later)

53 2003.10.28 - SLIDE 53IS 202 – FALL 2003 Content Analysis Areas How is the text processed? Index Pre-Process Parse Collections Rank Query Text Input How is the query constructed? Information Need

54 2003.10.28 - SLIDE 54 Document Processing Steps From “Modern IR” Textbook

55 2003.10.28 - SLIDE 55IS 202 – FALL 2003 Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –tengo, tienes, tiene, tenemos, tienen –Derivational Morphology Derive one word from another, Often change grammatical class –build, building; health, healthy

56 2003.10.28 - SLIDE 56IS 202 – FALL 2003 Automated Methods Powerful multilingual tools exist for morphological analysis –PCKimmo, Xerox Lexical technology –Require a grammar and dictionary –Use “two-level” automata Stemmers: –Very dumb rules work well (for English) –Porter Stemmer: Iteratively remove suffixes –Improvement: Pass results through a lexicon

57 2003.10.28 - SLIDE 57IS 202 – FALL 2003 Errors Generated by Porter Stemmer From Krovetz ‘93

58 2003.10.28 - SLIDE 58IS 202 – FALL 2003 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Discussion Credit for some of the slides in this lecture goes to Marti Hearst

59 2003.10.28 - SLIDE 59IS 202 – FALL 2003 Questions from Patrick Riley In Plato's Meno Dialogue, Plato asks: "How does one investigate what one does not know?" Plato's question is similar to typical questions we encounter in this and other readings of INFOSYS 202: how do we overcome the synonymy and polysemy problems faced by lexical searching? Can the LSA (Latent Semantic Analysis) and SVD (singular value decomposition) statistical techniques demonstrated by Demais et al solve the lexicon deficiencies in information retrieval?

60 2003.10.28 - SLIDE 60IS 202 – FALL 2003 Paradox The “Fundamental paradox of Information Retrieval” as stated by Roland Hjerrpe –The need to describe that which you do not know in order to find it

61 2003.10.28 - SLIDE 61IS 202 – FALL 2003 Questions from Patrick Riley This paper is from 1988...do you know of any applications or advancements of this LSA approach from the information retrieval community? (Example: AI (LSA passed the TEFL). And what are some of the limitations of using this corpus-based text comparison mechanism? (Example: no use of word order, incompleteness?) How does the LSA approach differ from other statistical approaches you've encountered? (Example: Google's "Similar Pages" feature.)

62 2003.10.28 - SLIDE 62IS 202 – FALL 2003 Questions from Joe Hall I would really like to see a show of hands (in class, I can't see you now!) of how many people have heard of either of the terms "Singular-value Decomposition" or "Eigenvector Decomposition" before you sat down to read this article. (I ask because we use this a lot in numerical approximation of radiative transfer in astrophysics... SVD is definately a litmus test as to whether or not a problem is difficult.)

63 2003.10.28 - SLIDE 63IS 202 – FALL 2003 Questions from Joe Hall I'm going to get picky here. In the Conclusion, Dumais et al. claim, "The latent structure [LSI] approach is useful for helping people find textual information in large collections." However, their results (and those of other researchers!) mostly contradict this claim. So which is it... does the SVD approach "offer no improvement over term matching methods" only for "relatively homogenous" groups of documents like "information science documents." Does LSI work best on widely different documents? Take a look at this paper's abstract which contradicts the Dumais findings: http://tinyurl.com/smfo

64 2003.10.28 - SLIDE 64IS 202 – FALL 2003 Questions from Joe Hall If you raised your hand for the first question, you may know that SVD is very computationally intensive... Dumais claims that "it need only be done once for each dataset." That's no fun... most datasets change over time... not only that, but most datasets grow with time... which means that SVD techniques can only be used on small, static, homogenous data sets (if you buy the link I showed above)... what fun is that? Where is SVD-enabled SLI useful? Is it merely a fascination of IR researchers and a way to write fancy grant proposals to make the next mazaratti payment?

65 2003.10.28 - SLIDE 65IS 202 – FALL 2003 Questions from Tu Tran In what context was this paper written? What was the state of the IR field? Imagine you are an information specialist and had to explain LSI and SVD to your non-mathematically oriented/non-technical manager. How would you do it? The paper did not include any user studies. Can you imagine tasks where users would not find this system useful?

66 2003.10.28 - SLIDE 66IS 202 – FALL 2003 Next Time Statistical Properties of Texts and Vector Representation Readings/Discussion: –Cooper, “Getting Beyond Boole” Dan –Bates, “How to use Controlled Vocabularies More Effectively in Online Searching” Ann –Hearst, “Improving Full-Text Precision on Short Queries Using Simple Constraints” Simon –Modern IR – Chapter 7 Sean


Download ppt "2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003"

Similar presentations


Ads by Google