© Tefko Saracevic, Rutgers University1 Web sources and library & information services Finding, evaluating and using a variety of Web sources for searching and reference
© Tefko Saracevic, Rutgers University2 Similarities between Web searching & IR & reference Basic principles to approach the same –human-human interaction - interview - social, organizational, cognitive, affective aspects to explore including task, need … –preparation of search concepts, terms, logic –determination of range, restrictions –estimation of relevance Basic principles to approach the same –human-human interaction - interview - social, organizational, cognitive, affective aspects to explore including task, need … –preparation of search concepts, terms, logic –determination of range, restrictions –estimation of relevance
© Tefko Saracevic, Rutgers University3 Differences Vastly different sources –as to contents, authority, reliability persistence –variation in amounts, depth, breadth Very different organization –little standardization, few if any fields Quite different search engines & capabilities -basic & advanced –also different from engine to engine Differing search strategies needed Vastly different sources –as to contents, authority, reliability persistence –variation in amounts, depth, breadth Very different organization –little standardization, few if any fields Quite different search engines & capabilities -basic & advanced –also different from engine to engine Differing search strategies needed
© Tefko Saracevic, Rutgers University4 Also: invisible Web Materials that general search engines cannot or WILL not include in their collection of Web pages (indexes) You cannot find through general search engines Contains a vast amount of information –much of it authoritative, qualitative Materials that general search engines cannot or WILL not include in their collection of Web pages (indexes) You cannot find through general search engines Contains a vast amount of information –much of it authoritative, qualitative
© Tefko Saracevic, Rutgers University5 Why search engines miss? Size: Web is huge, cannot cover all Economics: associated costs are high –also pay per crawl & rank Technical: still limited capabilities Spam: eliminating bad also looses good Restrictions: some site do not let in Deep structure: some sites complex Size: Web is huge, cannot cover all Economics: associated costs are high –also pay per crawl & rank Technical: still limited capabilities Spam: eliminating bad also looses good Restrictions: some site do not let in Deep structure: some sites complex
© Tefko Saracevic, Rutgers University6 Needed for Web searching Knowledge & competencies –variety of Web sources –their organization –search engines –Web search strategies –search dynamics, feedback Keeping up & up & up –constant updates, changes, innovations –many domain/subject specific Knowledge & competencies –variety of Web sources –their organization –search engines –Web search strategies –search dynamics, feedback Keeping up & up & up –constant updates, changes, innovations –many domain/subject specific
© Tefko Saracevic, Rutgers University7 Web size - who knows? Estimated over 16 million web servers Lawrence & Giles, 1999 –But only a fraction of direct search relevance Domains of sites 83% commercial, 6% scientific or educational; 3% health 2.5% personal; 2% societies; 1.5% government, about 1% each community, religion 1.5% pornographic Web Characterization Project - OCLC – statistics, trends, report, links … for 2001 reports 8.5 mill web sites – Estimated over 16 million web servers Lawrence & Giles, 1999 –But only a fraction of direct search relevance Domains of sites 83% commercial, 6% scientific or educational; 3% health 2.5% personal; 2% societies; 1.5% government, about 1% each community, religion 1.5% pornographic Web Characterization Project - OCLC – statistics, trends, report, links … for 2001 reports 8.5 mill web sites –
© Tefko Saracevic, Rutgers University8 Organization of sources No standardization across sources Major approaches in search engines –classification: many directory types used –statistical analyses of terms, links Metatags in sources –to enable retrieval by fields –HTML “keywords”, “description” 34% of sites use them –Dublin core -.3% sites use Organization: hindrance to retrieval –also faked contents to force retrieval No standardization across sources Major approaches in search engines –classification: many directory types used –statistical analyses of terms, links Metatags in sources –to enable retrieval by fields –HTML “keywords”, “description” 34% of sites use them –Dublin core -.3% sites use Organization: hindrance to retrieval –also faked contents to force retrieval
© Tefko Saracevic, Rutgers University9 Sources & search engines Indexed by search engines (publicly indexed) –by terms, selection, links, registration Not publicly indexed –many domain sources will not be found e.g digital libraries, online journals, reference –many commercial sites will hardly be found Differing approaches to inclusion/selection –mostly automatic; also generic source providers –increasingly added human evaluation & selection Indexed by search engines (publicly indexed) –by terms, selection, links, registration Not publicly indexed –many domain sources will not be found e.g digital libraries, online journals, reference –many commercial sites will hardly be found Differing approaches to inclusion/selection –mostly automatic; also generic source providers –increasingly added human evaluation & selection
© Tefko Saracevic, Rutgers University10 Search engine coverage No engine covers more than 16% of WWW In respect to combined coverage of 11 top: –Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek 5.2 –HotBot MS, Snap & Yahoo use Inktomi as search provider, but have different filtering & Inktomi databases Northern Light has ‘special collection’ - documents not part of publicly indexabable web Hard to discern & compare coverage Many national search engines - own coverage No engine covers more than 16% of WWW In respect to combined coverage of 11 top: –Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek 5.2 –HotBot MS, Snap & Yahoo use Inktomi as search provider, but have different filtering & Inktomi databases Northern Light has ‘special collection’ - documents not part of publicly indexabable web Hard to discern & compare coverage Many national search engines - own coverage
© Tefko Saracevic, Rutgers University11 Search features among engines Some search features the same across all but details differ - particularly in advanced –Boolean available but sometimes AND sometimes OR default –Differences may be found in: phrases, proximity, truncation, case sensitivity, relevance feedback, field searching, special features term expansion to concepts (latent semantic indexing) Some search features the same across all but details differ - particularly in advanced –Boolean available but sometimes AND sometimes OR default –Differences may be found in: phrases, proximity, truncation, case sensitivity, relevance feedback, field searching, special features term expansion to concepts (latent semantic indexing)
© Tefko Saracevic, Rutgers University12 Search strategies & outputs Geared toward very short searches –big majority of searches 2-3 terms (av. 2.5) in IR av making a big difference Directory browsing a big component - not in IR Geared toward limited top outputs Ranking output by relevance predominates –relevance calculation differ & proprietary (secret) –except Google - they published their method –affects search strategy - you guess how is done Geared toward very short searches –big majority of searches 2-3 terms (av. 2.5) in IR av making a big difference Directory browsing a big component - not in IR Geared toward limited top outputs Ranking output by relevance predominates –relevance calculation differ & proprietary (secret) –except Google - they published their method –affects search strategy - you guess how is done
© Tefko Saracevic, Rutgers University13 Meta search engines Search engines that cover search engines – many around e.g. –All4one four windows - good for comparison –CDNET Search.com meta engine of meta engines - customization Search Engines Worldwide 174 countries, over 1300 engines More on the horizon & differing Search engines that cover search engines – many around e.g. –All4one four windows - good for comparison –CDNET Search.com meta engine of meta engines - customization Search Engines Worldwide 174 countries, over 1300 engines More on the horizon & differing
© Tefko Saracevic, Rutgers University14 Specialized meta engines Selective with directories & large number of databases & search engines –Complete Planet –Invisible Web U.S. federal information via Government Printing Office Access –Federal Bulletin Board (file libraries for download from many agencies): Selective with directories & large number of databases & search engines –Complete Planet –Invisible Web U.S. federal information via Government Printing Office Access –Federal Bulletin Board (file libraries for download from many agencies):
© Tefko Saracevic, Rutgers University15 Reference (expert) services Reference services - several models –Q&A, directories, answers etc. – e.g. –Martindale’s Reference Desk - comprehensive –Ask Jeeves! – most popular –Ask ERIC – education questions- answers –Information Please - almanac type questions Academic libraries developing reference models - new service area Reference services - several models –Q&A, directories, answers etc. – e.g. –Martindale’s Reference Desk - comprehensive –Ask Jeeves! – most popular –Ask ERIC – education questions- answers –Information Please - almanac type questions Academic libraries developing reference models - new service area
© Tefko Saracevic, Rutgers University16 Libraries as Web sources Academic libraries providing open collections & services; models vary –Rutgers libraries - big long term effort –various sources & links involved for domain information& sources go to: –Electronic Reference Sources; Subject Research Guides: Social Sciences & Law; Library & Information Science –University of California, Berkeley - a most elaborate effort together with Sun Corporation Academic libraries providing open collections & services; models vary –Rutgers libraries - big long term effort –various sources & links involved for domain information& sources go to: –Electronic Reference Sources; Subject Research Guides: Social Sciences & Law; Library & Information Science –University of California, Berkeley - a most elaborate effort together with Sun Corporation
© Tefko Saracevic, Rutgers University17 Virtual libraries on the Web Libraries emerging only on the Web –More & more libraries & organizations involved Examples of academic & public libraries – Virtual Library - Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ –Toronto Public Library –Internet Public Library, Michigan Libraries emerging only on the Web –More & more libraries & organizations involved Examples of academic & public libraries – Virtual Library - Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ –Toronto Public Library –Internet Public Library, Michigan
© Tefko Saracevic, Rutgers University18 Domain sites Many domain/issue specific sites –rich & often unique coverage & services – different approaches & requirements Examples in health related domains: –Medscape - registration required –Rxlist - The Internet Drug Index –Mayo Clinic HealthOasis Many domain/issue specific sites –rich & often unique coverage & services – different approaches & requirements Examples in health related domains: –Medscape - registration required –Rxlist - The Internet Drug Index –Mayo Clinic HealthOasis
© Tefko Saracevic, Rutgers University19 Societies, organizations, publishers Great many rich sources for searching –differences in requirements, depth, richness Examples from variety of organizations: –Assoc. for Computing Machinery Digital Library; subscription or registration –State department about the U.S & other countries –R.R. Bowker Free Resources from Bowker; Library Resource Guide –Genealogy: Great many rich sources for searching –differences in requirements, depth, richness Examples from variety of organizations: –Assoc. for Computing Machinery Digital Library; subscription or registration –State department about the U.S & other countries –R.R. Bowker Free Resources from Bowker; Library Resource Guide –Genealogy:
© Tefko Saracevic, Rutgers University20 Language barriers on the Web English still the major language – but declining, now slightly over 50% Multilingual retrieval search engines –Euroseek – searches 40 languages –All the Web – 45 languages –in both, search in different languages covers primarily their language sources English still the major language – but declining, now slightly over 50% Multilingual retrieval search engines –Euroseek – searches 40 languages –All the Web – 45 languages –in both, search in different languages covers primarily their language sources
© Tefko Saracevic, Rutgers University21 Language barriers: translations A number of translation sites –machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language, but effectiveness??? – Free Translations –Babel Fish –Travlang – great for travelers – phrases A number of translation sites –machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language, but effectiveness??? – Free Translations –Babel Fish –Travlang – great for travelers – phrases
© Tefko Saracevic, Rutgers University22 Key professional competencies Knowledge of SOURCES in area of interest search engines not enough not too helpful in finding these other sources; structure hard to discern Evaluation of sources –a key professional skill! standard criteria: quality, veracity, coverage etc plus Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage, persistence, usability – Knowledge of SOURCES in area of interest search engines not enough not too helpful in finding these other sources; structure hard to discern Evaluation of sources –a key professional skill! standard criteria: quality, veracity, coverage etc plus Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage, persistence, usability –
© Tefko Saracevic, Rutgers University23 competencies … Knowledge of users & use Knowledge of searching Use of technology Adaptability, flexibility Integration with other resources Teaching others Constant learning & update Knowledge of users & use Knowledge of searching Use of technology Adaptability, flexibility Integration with other resources Teaching others Constant learning & update
© Tefko Saracevic, Rutgers University24