Presentation is loading. Please wait.

Presentation is loading. Please wait.

8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs.

Similar presentations


Presentation on theme: "8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs."— Presentation transcript:

1 8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs 10:30-12:00 am Fall 2001 Lecture authors: Marti Hearst & Ray Larson

2 8/28/2001Information Organization and Retrieval Today Introductions Course Overview Administrivia

3 8/28/2001Information Organization and Retrieval Goals of the Course Learn about: –Design, development and use of information storage and retrieval systems –Practical and theoretical foundations of information organization and analysis –Evaluation of information access systems –Cognitive and user-centric considerations –Hands-on experience with information systems

4 8/28/2001Information Organization and Retrieval Two Main Themes Information Organization and Design Information Retrieval and the Search Process

5 8/28/2001Information Organization and Retrieval Web Search Questions What do people search for? How do people use search engines? –How often do people find what they are looking for? –How difficult is it for people to find what they are looking for? How can search engines be improved?

6 8/28/2001Information Organization and Retrieval What Do People Search for on the Web? Study by Spink et al., Oct 98 –www.shef.ac.uk/~is/publications/infres/paper53.html –Survey on Excite, 13 questions –Data for 316 surveys

7 8/28/2001Information Organization and Retrieval What Do People Search for on the Web? Topics Genealogy/Public Figure:12% Computer related:12% Business:12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14% Something is missing…

8 8/28/2001Information Organization and Retrieval What do people search for on the web? 4660 sex 3129 yahoo 2191 internal site admin check from kho 1520 chat 1498 porn 1315 horoscopes 1284 pokemon 1283 SiteScope test 1223 hotmail 1163 games 1151 mp3 1140 weather 1127 www.yahoo.com 1110 maps 1036 yahoo.com 983 ebay 980 recipes l 50,000 queries from excite 1997 l Most frequent terms:

9 8/28/2001Information Organization and Retrieval Why do these differ? Self-reporting survey The nature of language –Only a few ways to say certain things –Many different ways to express most concepts UFO, Flying Saucer, Space Ship, Satellite How many ways are there to talk about history?

10 8/28/2001Information Organization and Retrieval Intranet Queries (Aug 2000) 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid

11 8/28/2001Information Organization and Retrieval Intranet Queries Summary of sample data from 3 weeks of UCB queries –13.2% Telebears/BearFacts/InfoBears/BearLink (12297) –6.7% Schedule of classes or final exams (6222) –5.4% Summer Session (5041) –3.2% Extension (2932) –3.1% Academic Calendar (2846) –2.4% Directories (2202) –1.7% Career Center (1588) –1.7% Housing (1583) –1.5% Map (1393) Average query length over last 4 months: 1.8 words This suggests what is difficult to find from the home page

12 8/28/2001Information Organization and Retrieval An Example Search System: Cha-Cha A system for searching complex intranets Places retrieval results in context

13 8/28/2001Information Organization and Retrieval An Example Search System: Cha-Cha Important design goals: –Users at any level of computer expertise –Browsers at any version level –Computers of any speed

14 8/28/2001Information Organization and Retrieval

15 8/28/2001Information Organization and Retrieval

16 8/28/2001Information Organization and Retrieval Search: Where to Start? Guess words? –Search engine plunges you into the middle of a site/collection –Too many or too few results –No context Use a directory? –If large, may be difficult/frustrating to navigate –Several ways to organize the information –May not reflect users’ needs Solution: Integrate Browsing and Search

17 8/28/2001Information Organization and Retrieval

18 8/28/2001Information Organization and Retrieval

19 8/28/2001Information Organization and Retrieval How Cha-Cha Works Crawl entire Intranet Compute the shortest hyperlink path from a certain root page to every web page Index and compute metadata for the pages –Using Cheshire II –Run a user query. –Gather all the hits –Create a “directory” based on combining the shortest paths –Special graph algorithm removes redundant links and internal nodes

20 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture crawl the web store the documents

21 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture crawl the web store the documents create files of metadata Cheshire II

22 8/28/2001Information Organization and Retrieval Cha-Cha Metadata Information about web pages –Title –Length –Inlinks –Outlinks –Shortest Paths from a root home page Used to provide innovative search interface

23 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture crawl the web store the documents create files of metadata Cheshire II

24 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture crawl the web create a keyword index store the documents create files of metadata Cheshire II

25 8/28/2001Information Organization and Retrieval Creating a Keyword Index For each document –Tokenize the document Break it up into tokens: words, stems, punctuation There are many variations on this –Record which tokens occurred in this document Called an Inverted Index Dictionary: a record of all the tokens in the collection and their overall frequency Postings File: a list recording for each token, which document it occurs in and how often it occurs

26 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture Cheshire II user query

27 8/28/2001Information Organization and Retrieval Responding to the User Query User searches on “pam samuelson” Search Engine looks up documents indexed with one or both terms in its inverted index Search Engine looks up titles and shortest paths in the metadata index User Interface combines the information and presents the results as HTML

28 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture Cheshire II user query

29 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture Cheshire II server accesses the databases

30 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture Cheshire II results shown to user

31 8/28/2001Information Organization and Retrieval Cha-Cha System Architecture Cheshire II results shown to user server accesses the databases user query

32 8/28/2001Information Organization and Retrieval What hasn’t been explained here? How documents are ranked How queries are formed How shortest paths are computed How the system is built –… among other things! –This is just an introduction! Much more later.

33 8/28/2001Information Organization and Retrieval Two Main Themes Information Organization and Design Information Retrieval and the Search Process

34 8/28/2001Information Organization and Retrieval (Approximate) Course Schedule Retrieval –The Search Process –Content Analysis Tokenization, Zipf’s Law, Lexical Associations –IR Implementation –Term weighting and document ranking Vector space model Probabilistic model –User Interfaces Overviews, query specification, providing context, relevance feedback

35 8/28/2001Information Organization and Retrieval Overview Example Web site design/ Information Architecture –Incorporates many of the organizational issues we will be covering –Example taken from a study of professional designers, by Mark Newman

36 Information Organization and Retrieval Information Architecture and Web Site Design Information design –structure, categories of information Navigation design –interaction with information structure Graphic design –visual presentation of information and navigation (color, typography, etc.)

37 Information Organization and Retrieval Design Specialties Information Architecture –includes management and more responsibility for content User Interface Design –includes testing and evaluation

38 Information Organization and Retrieval Web Site Design Process Implementation Design Preliminary Design Conceptualization Needs Assessment

39 Information Organization and Retrieval Design Process: Preliminary Design (information/navigation design: schematic)

40 Information Organization and Retrieval Design Process: Preliminary Design (navigation design: storyboard)

41 Web Site Design Process Major design activities are: –Deciding on a set of categories that define the information content –Deciding how to represent these –Deciding on the navigation structure through the categorized content Example: a movie listing website There are similarities and differences to: –Database design –Thesaurus design

42 8/28/2001Information Organization and Retrieval (Approximate) Course Schedule Organization –Overview –Metadata and Markup –Controlled Vocabularies, Classification, Thesauri –Information Design Thesaurus Design Database Design

43 8/28/2001Information Organization and Retrieval Assignments and Exams Approximately 9 short assignments (due within one week – ten days) –Sometimes “checked”, sometimes graded One Midterm exam (take-home open book) Final exam (during Finals week) Grading: –Assignments: 40% Not evenly weighted –Final: 25% –Midterm: 25% –Class Participation: 10%

44 8/28/2001Information Organization and Retrieval Readings Course Reader –Will be available in about a week (will announce) –Textbooks Modern Information Retrieval, Baeza-Yates and Ribiero-Neto (Eds.), Addison Wesley, 1999 The Organization of Information, Arlene G. Taylor, Libraries Unlimited, 1999,

45 8/28/2001Information Organization and Retrieval Homework (!) Read the handout (Borges and Dennett) Write one or two paragraphs on –What is information, according to your background or area of expertise? Due in class this Thursday, Aug 30.

46 8/28/2001Information Organization and Retrieval What is Information? There is no “correct” definition Can involve philosophy, psychology, signal processing, physics Cookie Monster’s definition: – “news or facts about something” Oxford English Dictionary –information: informing, telling; thing told, knowledge, items of knowledge, news –knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known

47 8/28/2001Information Organization and Retrieval Next Time More on What is Information? And How much of it is out there?


Download ppt "8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs."

Similar presentations


Ads by Google