Download presentation
Presentation is loading. Please wait.
1
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
2
Web Search Engines How does web search differ from: –Lexis-nexis? –Bibliographic search (melvyl)?
3
How do Web Search Engines Differ? Different kinds of information –Unedited – anyone can enter »Quality issues »Spam –Varied information types »Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! –Sources are not differentianted »Search over medical text the same as over product catalogs
4
Directories vs. Search Engines An IMPORTANT Distinction l Directories –Hand-selected sites –Search over the contents of the descriptions of the pages –Organized in advance into categories l Search Engines –All pages in all sites –Search over the contents of the pages themselves –Organized after the query by relevance rankings or other scores
5
How does web search differ? Different kinds of users –Lexis-nexis: »professional searchers »paying (by the query or by the minute) –Online catalogs (melvyl) »scholars searching scholarly literature –Web »Every type of person with every type of goal »No “driving school” for searching
6
How do Web Search Engines Differ? Different kinds of information needs: what does the user want to know? »Example: Search on “Mazda” l What does this mean on the web? l What does this mean on lexis-nexis? »Example: “Mazda transmissions” »Example: “Manufacture of Mazda transmissions in the post-cold war world”
7
What Do People Search for on the Web? (from Spink et al. 98 study) Topics »Genealogy/Public Figure:12% »Computer related:12% »Business:12% »Entertainment: 8% »Medical: 8% »Politics & Government 7% »News 7% »Hobbies 6% »General info/surfing 6% »Science 6% »Travel 5% »Arts/education/shopping/images 14%
8
l Web search queries are SHORT –~2.4 words on average (Aug 2000) –Has increased, was 1.7 (~1997) l User Expectations –Many say “the first item shown should be what I want to see”! –This works if the user has the most popular/common notion in mind Web Search Queries
9
Recent statistics from Inktomi, August 2000, for one client, one week Total # queries: 1315040 Number of repeated queries: 771085 Number of queries with repeated words: 12301 Average words/ query: 2.39 Query type: All words: 0.3036; Any words: 0.6886; Some words:0.0078 Boolean: 0.0015 (0.9777 AND / 0.0252 OR / 0.0054 NOT) Phrase searches: 0.198 URL searches: 0.066 URL searches w/http: 0.000 email searches: 0.001 Wildcards: 0.0011 (0.7042 '?'s ) frac '?' at end of query: 0.6753 interrogatives when '?' at end: 0.8456 composed of: who: 0.0783 what: 0.2835 when: 0.0139 why: 0.0052 how: 0.2174 where 0.1826 where-MIS 0.0000 can,etc.: 0.0139 do(es)/did: 0.0
10
How to Optimize for Short Queries? l Find good starting places –User still has to search at the site itself l Dialogues –Build upon a series of short queries –Not well understood how to do this for the general case l Question Answering –AskJeeves – hand edited –Automated approaches are under development »Very simple »Or domain-specific
11
How to Find Good Starting Points? l Manually compiled lists –Directories –e.g., Yahoo, Looksmart, Open directory l Page “popularity” –Frequently visited pages (in general) –Frequently visited pages as a result of a query l Link “co-citation”, –which sites are linked to by other sites? l Number of pages in the site –Not currently used (as far as I know)
12
Link Analysis for Starting Points l Assumptions: –If the pages pointing to this page are good, then this is also a good page. –The words on the links pointing to this page are useful indicators of what this page is about. –References: Page et al. 98, Kleinberg 98
13
Link Analysis for Starting Points l Why does this work? –The official Toyota site will be linked to by lots of other official (or high-quality) sites –The best Toyota fan-club site probably also has many links pointing to it –Less high-quality sites do not have as many high-quality sites linking to them
14
Link Analysis for Starting Points l Does this really work? –Actually, there have been no rigorous evaluations –Seems to work for the primary sites; not clear if it works for the relevant secondary sites –One (small) study suggests that sites with many pages are often the same as those with good link co- citation scores. (Terveen & Hill, SIGIR 2000)
15
What is Really Being Used? l Todays search engines combine these methods in various ways –Integration of Directories »Today most web search engines integrate categories into the results listings »Lycos, MSN, Google –Link analysis »Google uses it; others are using it or will soon »Words on the links seems to be especially useful –Page popularity »Many use DirectHit’s popularity rankings
16
What about Ranking? l Lots of variation here –Pretty messy in many cases –Details usually proprietary and fluctuating l Combining subsets of: –Term frequencies –Term proximities –Term position (title, top of page, etc) –Term characteristics (boldface, capitalized, etc) –Link analysis information –Category information –Popularity information
17
High-Precision Ranking Proximity search can help get high- precision results if > 1 term –Hearst ’96 paper: »Combine Boolean and passage-level proximity »Proves significant improvements when retrieving top 5, 10, 20, 30 documents »Results reproduced by Mitra et al. 98 »Google uses something similar
18
Boolean Formulations, Hearst 96 Results
19
Web Spam l Email Spam: –Undesired content l Web Spam: –content disguised as something it is not in order to »Be retrieved more often than it otherwise would »Be retrieved in contexts that it otherwise would not be retrieved in
20
Web Spam l What are the types of Web spam? –Add extra terms to get a higher ranking »Repeat “cars” thousands of times –Add irrelevant terms to get more hits »Put a dictionary in the comments field »Put extra terms in the same color as the background of the web page –Add irrelevant terms to get different types of hits »Put “sex” in the title field in sites that are selling cars –Add irrelevant links to boost your link analysis ranking l There is a constant “arms race” between web search companies and spammers
21
Commercial Issues l General internet search is often commercially driven –Commercial sector sometimes hides things – harder to track than research –On the other hand, most CTOs for search engine companies used to be researchers, and so help us out –Commercial search engine information changes monthly –Sometimes motivations are commercial rather than technical »Goto.com uses payments to determine ranking order
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.