Presentation is loading. Please wait.

Presentation is loading. Please wait.

A.Frank 1 Digital Libraries (DL): Awareness and Discovery Ariel Frank Dept. of Computer Science Bar-Ilan University Joint research with Nir Yom Tov, Alon.

Similar presentations


Presentation on theme: "A.Frank 1 Digital Libraries (DL): Awareness and Discovery Ariel Frank Dept. of Computer Science Bar-Ilan University Joint research with Nir Yom Tov, Alon."— Presentation transcript:

1 A.Frank 1 Digital Libraries (DL): Awareness and Discovery Ariel Frank Dept. of Computer Science Bar-Ilan University Joint research with Nir Yom Tov, Alon Kadury & Elina Masevich

2 2 A.Frank Presentation motivation  Ad hoc and unsound use of Search Engines (SEs) does not help for retrieval of quality information on the Web.  Digital Libraries (DLs), on the other hand, provide high quality information retrieval of authoritative results, especially when doing exploratory search.  However, the awareness and discovery of DLs on the Web are still lacking.  So what can be done about it?

3 3 A.Frank Contents SEs vs. DLs?! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions

4 4 A.Frank Google/SE Awareness

5 5 A.Frank So how to overcome Googlism?!

6 6 A.Frank Often heard sayings “What – is there something to search with besides search engines?” “Sure I know all about search engines – I always use google.” “Sure I know all about directories – I always use yahoo!” “Sorry, never heard about digital libraries.” “Listen, I’m used to classical libraries.” “I can find only E-books in a digital library, no?”

7 7 A.Frank Digital Library Vision?!

8 8 A.Frank Sample list of Digital Libraries LOC - Library of Congress American Memory ( http://memory.loc.gov/ammem/ ) http://memory.loc.gov/ammem/ NSDL - National Science DL (http://nsdl.org)http://nsdl.org IPL - Internet Public Library (http://www.ipl.org)http://www.ipl.org CDL - California DL (http://www.cdlib.org)http://www.cdlib.org ADL – Alexandria DL ( http://www.alexandria.ucsb.edu ) http://www.alexandria.ucsb.edu BL - British Library (http://www.bl.uk/)http://www.bl.uk/ NZDL – New Zealand DL (http://www.nzdl.org/ )http://www.nzdl.org/ Einstein Archives Online ( http://www.alberteinstein.info/ ) http://www.alberteinstein.info/

9 9 A.Frank Web Index D irectory Search Engines Which kind to use? The right one Which kind to use? The right one S earch E ngine General SpecialtyGeneralSpecialty Meta-S earch E ngine

10 10 A.Frank When not to use SEs? You know it all. You prefer asking friends (or paid experts ). You know the Web site for it (and didn’t forget the exact URL or have auto-completion or bookmark or can access through another known site). You already found a specific/relevant digital library or database (maybe in Invisible Web). Tired of paid inclusions, SE spamming, and sponsored commercial results. Tired of chasing down useless URLs.

11 11 A.Frank When to use an Index? Need to search for a narrow piece of information. Have a specific objective/site in mind. Want to find/rank many related Web sites. Want to factor quantity in (index has crawler based results). Need to check/fix spelling (based on Web statistics).

12 12 A.Frank When to use a Directory? Clear about the exact topic of your query. Need general information on a rather broad topic/category. Want to amass knowledge on a fairly wide subject. Would like to browse (and then search) a certain area. Want to factor quality in (directory has human- powered results), not quantity. Need information that is usually carefully evaluated and even annotated.

13 13 A.Frank When to use a Meta-SE? When single Basic-SE fails to provide good results. One-stop shopping - prefer to search multiple SEs/sites at once to get blended ranked results (so as to save effort/time). When the query is simple (complex fields/options don't usually work). Searching for multi-faceted topics. Want to get clustered results to focus search on the relevant keywords. Looking for current events/news.

14 14 A.Frank When to use a Specialty-SE? When general-SE fails to provide good results. When your target is very topic/technology specific. Want to find more than just Web pages/sites. Need more results from the Invisible Web. Want your search terms to more likely have the meanings you intended them to have.

15 15 A.Frank SE Quantity vs. DL Quality? SE DL

16 16 A.Frank SE vs. DL Potential Coverage Resources Relevant SE DL

17 17 A.Frank Contents SEs vs. DLs?! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions

18 18 A.Frank Classical (Analogical) Library

19 19 A.Frank So What is a Digital Library? There are scores of definitions. Most are very general and verbose.  A managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network. Arms, William, Y., Digital Libraries, MIT press, Cambridge, 2000.

20 20 A.Frank Definition - A Digital Library is: 1. Collection of digital objects 2. Collection of knowledge structures 3. Collection of library services 4. Library Categories: Domain, Focus & Topic 5. Quality Control 6. Preservation/Persistence

21 21 A.Frank 1. Collection of Digital Objects Documents (e.g., texts, HTML pages) Books Journals Multimedia (images, audio, video, etc…) Charts/Maps  Data objects available directly or indirectly

22 22 A.Frank 2. Collection of Knowledge Structures Metadata: Standards, Markup Indices, Catalogs, Guides Taxonomies, Ontologies, Thesauri Dictionaries, Glossaries, Concordances Gazetteers Abstracts/Summaries

23 23 A.Frank 3. Collection of Library Services Management (computerization, communication) Collections development Search (query formulation) and Browse interfaces Multi-access/use for varied users Online Help, Reference, Consultation Logging, statistics and Performance Measurement Evaluation (PME) SDI: Selective Dissemination of Information (Push mode)

24 24 A.Frank 4. Library Categories: Domain, Focus & Topic Domain: belongs to an area (DNS TLDs). –edu, com, org, gov, us, il, ac.il, co.il, … Focus: created to serve a certain community of users/patrons. –Academic, Public, National, School, … Topic: the subject of the collection; can be relatively finely-grained. –Law, Medicine, Music, Web, …

25 25 A.Frank 5. Quality Control Selection criteria. All material is assessed and authorized (“certified”). Adhere to licensing and copyrights. Use of Digital Rights Management (DRM). Integrity enforced (proven quality). Use of filtering. Support for profiling/stereotyping.

26 26 A.Frank 6. Preservation/Persistence Access and usage is long term Serves as an archive Scanning and digitization Quality reproduction of material Material persistency –paper vs. digital media –digital formats (software tools)

27 27 A.Frank Need for a delicate balance

28 28 A.Frank Basic SE (BSE) Meta SE (MSE) Popularity SE (PSE) Stand-alone DL (SDL) Harvested DL (HDL) Federated DL (FDL) Digital Library (DL) Search Engine (SE) Directory (Catalog, Guide, Subject Gateway) Web Repositories Hierarchy

29 29 A.Frank Types of DLs Stand-alone Digital Library (SDL) – also self-contained, several collections Federated Digital Library (FDL) – also confederated, networked Harvested Digital Library (HDL) – also distributed

30 30 A.Frank Stand-alone Digital Library (SDL) The regular (classical) DL. Implemented locally in a fully computerized fashion, with networked access. Self-contained material: – edited/generated – scanned/digitized – purchased Single or Several digital collections.

31 31 A.Frank Federated Digital Library (FDL) Contains several autonomous libraries. Based on common focus and topic. Usually heterogeneous repositories. Connected via a network. Forms a flat unified library. Transparent user interface.  The major problem is interoperability

32 32 A.Frank Harvested Digital Library (HDL) Virtual library providing metadata-based access to relevant items distributed over the network. Objects harvested into metadata (protocol was Harvest/SOIF, nowadays OAI-PMH can be used). Harvests digital objects, not full DLs. But has regular DL characteristics.

33 33 A.Frank SDL vs. HDL

34 34 A.Frank Parallel Evolution of SEs and DLs Search Engines Generations Digital Libraries Generations 1 st Generation – Basic SE (BSE) includes Robots, Indices, Directories, basic/advanced user interfaces. 1 st Generation – Stand-alone (SDL) local, classical, focused material, digitized or scanned. 2 nd Generation – Meta SE (MSE) uses several basic-SEs simultaneously (federated search), ranks gathered pages by relevancy. 2 nd Generation – Federated (FDL) Comprised of autonomous SDLs representing related, possibly heterogeneous, network repositories 3 rd Generation – Popularity SE ( PSE) uses link analysis and use frequency measures to filter and rank the Web pages. 3 rd Generation – Harvested (HDL) contains only summaries and metadata structures; domain focused, of fine granularity.

35 35 A.Frank Contents SEs vs. DLs?! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions

36 36 A.Frank Why are SEs overused? I always use Google/Yahoo! It’s just a quick search! The truth? – not sure what I’m looking for. I’m too used to using SEs. SEs are more general, no? SEs always give me enough answers. SEs don’t care what my topic/domain is!

37 37 A.Frank SE vs. DL - Server Side

38 38 A.Frank SE vs. DL - Client Side

39 39 A.Frank So what was the message ?

40 40 A.Frank Qualitative IR from Digital Library?! Fact: Quantity orientation in SE. Fact: Quality orientation in DL. ? Assumption: Accessible DLs in sought after domain. ? Assumption: Usable information retrieval interfaces for DLs.  Result: High quality information retrieval from digital libraries!

41 41 A.Frank Why are DLs underused (social)? Too used to classical libraries (fond memories). No public awareness (an unknown entity). No public relations (unlike for Portals/SEs). No money in it (marketing, banners, services). If It’s a library, you have to pay to use it, no? Are DLs up-to-date at all (as much as SEs)? No DLs in my language (localization).

42 42 A.Frank Why are DLs underused (general)? Portals don’t offer DLs (services). Aren’t DLs part of the Invisible/Deep Web? DLs are just for experts! Many interests – will need to know many DLs. How to find them at all (need to startjump)? How to find relevant ones (sounds like search). How to find the right one (too many around). Lack of domain coverage (no DL in my area).

43 43 A.Frank Why are DLs underused (technical)? SEs crawl/index DLs, no? Aren’t directories enough? Aren’t SSEs (Specialized SEs) enough? Too focused/limited (too fine granularity). Need know-how to use DLs (unlike for SEs). Non-usable interfaces (not user-friendly). Mostly textual, not multimedia (like SEs are).

44 44 A.Frank DL Awareness & Discovery Problems Lack of use and familiarity with DLs. Hard to locate and identify DLs scattered around the Web. Not enough metadata kept for and on the DLs. DLs topic and focus and user interfaces are not always clear and usable.

45 45 A.Frank So how to tilt the balance of SE/DL use?

46 46 A.Frank Sample (Digital) Library Directories Berkeley LibWeb (Library Servers via Web) – http://sunsite.berkeley.edu/Libweb/ http://sunsite.berkeley.edu/Libweb/ Academic Info: Digital Libraries – http://www.academicinfo.net/digital.html http://www.academicinfo.net/digital.html Google Directory: Digital Libraries – http://directory.google.com/Top/Reference/Libraries/Digital/ http://directory.google.com/Top/Reference/Libraries/Digital/ Librarians’ Index to the Internet – http://lii.org/ http://lii.org/

47 47 A.Frank Use General SEs and DL Directories? Why can’t just use large general SEs? –noisy results, metadata not sufficient, too many (re)tries to get relevant results. Why can’t just use existing DL Directories? –messy categorization, non-friendly UI, not all libraries are DLs, not really DL Directories.

48 48 A.Frank Some possible directions/solutions Get SEs to better index, reference, and advertise DLs. Provide specialized SEs for locating DLs. Construct and enhance DL directories. DL coverage of more topics/domains. Employ SE like interfaces in DLs: –user-friendly interface (Google-like) –easy-to-use site (usability like in SE)

49 49 A.Frank If more time... we could SEEk more

50 50 A.Frank Theory vs. Practice?

51 51 A.Frank Contents SEs vs. DLs?! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions

52 52 A.Frank SELFDL Goals  Search Engine Locator For Digital Libraries Discover/identify/classify/generate DL resources/sites in the (in)visible Web. Supply search tools for users to find relevant DLs for their needs. Provide better, usable (thin) interfaces for locating DLs. Raise awareness, knowledge, discovery and use of DLs.

53 53 A.Frank Naming

54 54 A.Frank SELFDL Model/Architecture IndexDirectoryMeta

55 55 A.Frank SELFDL – gateway to world of DLs

56 56 A.Frank SELFDL techniques Harness SE technologies to locate DLs on the Web using: –Extractors: Extract DLs from DLs directories. –Crawlers: focused crawl in search of DLs. –Scripts: Interface with Google/Yahoo APIs. Use site analysis (search for DL terms). Support Extended DC (Dublin Core) metadata for each DL. Provide SELFDL database indexing.

57 57 A.Frank DLs Identification test Manual collection of a list of 65 terms that could be indicative that a Web site is a DL. Check if there is statistically significant connection between each of the terms and the fact that a Web site is a DL. Initial statistical test included 100 manually identified DLs and a 100 random Web sites. The statistical measure used (in SPSS) was Cross tabulation, tested with Chi-square, phi coefficient and Cramer’s V.

58 58 A.Frank Results of DLs Identification test Terms that have been found to be statistically significant: 1.documents, book(s), journal(s), electronic/internet/web resource(s) 2.catalog(s)/catalogue(s) 3.ask a librarian, patron(s) 4.digital library, library, digital collection(s) 5.copyright(s) 6.preservation/preserve, digitization/digitize

59 59 A.Frank SELFDL Directory UI

60 60 A.Frank SELFDL Directory classifications TopicFocusDomain Digital Library DDCBreedingIANA Countries -.IL Commercial -.COM Educational -.EDU Children Academic Professional Life Science: DDC 570 Earth Science: DDC 550 Biology: DDC 574

61 61 A.Frank Example DDC topic’s tree

62 62 A.Frank SELFDL Directory results example

63 63 A.Frank Advantages of SELFDL Directory Contains just DLs. Better classification/perspective based on domain/focus/topic. Provides user-friendly interface; like Google Directory. Additional metadata (based on DC).

64 64 A.Frank SELFDL Index UI

65 65 A.Frank SELFDL Index Results from Web focused crawling. Can be searched for specific DL criteria: –keywords –DL type (SDL, FDL, HDL) –DL media/content (audio, E-books, E-serials, theses, movies, etc…) –Protocol support (OAI-PMH)

66 66 A.Frank SELFDL Index example queries topic:biology domain:com algebra domain:com source:crawler focus:children type:SDL protocol:OAI topic:math media:ebooks

67 67 A.Frank SELFDL Index results example

68 68 A.Frank Advantages of SELFDL Index Built according to insights/techniques of various studies in the field. Supports directory and crawler results. Provides specialized SE for DLs. Easy to use query interface. Supports advanced keywords search.

69 69 A.Frank SELFDL Meta

70 70 A.Frank SELFDL Meta Engine Can be searched for DL keywords like in an ordinary search engine. Intersects SE (i.e., Google/Yahoo API) results with SELFDL database to extract the current DLs to be returned as query response. Performs like a regular SE – convenient for public use.

71 71 A.Frank YAHOO! SELFDL intersects with Google & Yahoo! results SELFDL Google Relevant DLs

72 72 A.Frank SELFDL Meta results example

73 73 A.Frank Google “Sponsored” DL Interface

74 74 A.Frank Advantages of SELFDL Meta Provides all the advantages of the SELFDL model (UI, metadata). Supports query interface for terms, like existing SEs. Supports intersection between SEs results and relevant DLs. Supports different orders of results.

75 75 A.Frank SELFDL prototype testing methods Efficiency measures were computed for Directory and Meta. Satisfaction surveys were given to users before and after SELFDL use. A check was carried out to find the best GUI for SELFDL (regular or Google-like).

76 76 A.Frank Efficiency testing methods Series of queries were evaluated for results relevancy. The F-measure was used as the efficiency measure. Where: P – Precision of results R – Relative recall of results F – Weighted harmonic average of P & R = 2PR/(P+R) The two components tested were SELFDL Meta and SELFDL Directory.

77 77 A.Frank SELFDL Directory vs. DL Directories R P

78 78 A.Frank SELFDL Meta vs. Google & Yahoo RP

79 79 A.Frank Users’ satisfaction surveys 1.Usability of Web utilities. 2.Ease of locating DLs. 3.Ease of identifying if site is DL. 4.DL results relevance. 5.DL metadata readability.

80 80 A.Frank Google DL Interface

81 81 A.Frank Contents SEs vs. DLs?! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions

82 82 A.Frank RIDDLE Goals  Resource Inquiry and Discovery in a DL Environment Enable creation of HDLs by harvesting (filtering) relevant SDLs using OAI-PMH. Enable construction of HDLs based on composition of lower-level HDLs, so as to increase the coverage of DLs’ topics. Enable information exchange with SELFDL. Raise awareness, knowledge, discovery and use of DLs.

83 83 A.Frank Example of topics’ composition University Life Sciences Exact Sciences Social Sciences Chemistry Computer Science HardwareSoftware

84 84 A.Frank OAI-PMH Protocol OAI-PMH - Open Archive Initiative (OAI) Protocol for Metadata Harvesting Tackles lack of uniformity and interoperability between data repositories, that make information sharing between repositories difficult. Addresses these problems by defining the way queries are sent to repositories and the way answers are received. Mandates at least one format of metadata for repositories use – Dublin Core (DC).

85 85 A.Frank RIDDLE Model/Architecture Enhanced OAI-PMH Layer 4 – Aggregated Service Providers HDL Layer 1 – Internet SDL Layer 2 – Data Providers Layer 5 – Presentation Layer 3 – Service Providers Web interfaces Aggregated HDLs Web HDL OAI-PMH

86 86 A.Frank Use of OAI-PMH for FDLs/HDLs OAI-PMH was planned to support harvesting, as manifested in its name, and also in its design (i.e., selective harvesting using “Sets”). However, the number of FDLs that use the protocol is relatively large, while there very few HDLs that employ it. Since HDLs, unlike FDLs, filter the information, and not just federate it, we investigate ways by which HDLs can filter information using the OAI-PMH protocol.

87 87 A.Frank Levels of information filtering There are 3 levels where information filtering can be done, though each level has its various problems, mostly caused by lack of uniformity between SDLs: 1.Item-level metadata – relates to problems with the use of DC entries (that are well known). 2.Group-level metadata – the use of OAI-PMH Sets for selective harvesting is not well defined, so it can not be easily used for relating to groups of items. 3.Library-level metadata – description of the metadata of this level is not well defined. Creation of HDLs using OAI-PMH is not fully supported.

88 88 A.Frank Suggested extensions to OAI-PMH Since lack of uniformity in SDLs using OAI-PMH prevents effective creation of HDLs. Provide for better harvesting/filtering capabilities from SDLs, by (re-)use of standards, as follows: 1.Item-level metadata – use of extended DC for metadata description, instead of just DC. 2.Group-level metadata – use of a DDC topic as a defined Set identifier. 3.Library-level metadata – use of extended DC for the library description field in the OAI-PMH Identify verb.

89 89 A.Frank The RIDDLE Prototype Provides for regular creation of FDLs. Enables creation of HDLs by harvesting/filtering the relevant SDLs. Supports HDL aggregation based on DDC hierarchy. The user search results return not only items matching the query but also HDLs and SDLs related to the indicated topic. The user can search the HDLs hierarchy (by textual or directory search) for a specific HDL and further down the aggregated HDLs tree.

90 90 A.Frank RIDDLE entry page

91 91 A.Frank Sample results page, first entry an HDL

92 92 A.Frank HDL aggregation The HDL aggregation capability is based on: –use of the DDC topics hierarchy. –assigning each HDL a suitable DDC topic identifier. –providing it with an OAI-PMH interface, similar to the what data providers have, thus enabling and supporting a HDLs hierarchy. –supporting both offline and online construction and corresponding search.

93 93 A.Frank Directory search with topics

94 94 A.Frank RIDDLE Experimentation Several tests where carried out, as follows: 1.The quality of information retrieval when using a specific HDL vs. use of several FDLs. 2.Ease of discovering and using the aggregated HDLs. 3.User preferences in searching several FDLs vs. use of aggregated HDLs. Initial testing indicates that use of HDLs and aggregated HDLs are more efficient when compared to the use of separate FDLs.

95 95 A.Frank Efficiency measures for RIDDLE

96 96 A.Frank Contents SEs vs. DLs?! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions

97 97 A.Frank Future directions Better locating, identification and ranking of DLs and their categories/types. Conduct wider, more significant, tests using SELFDL and RIDDLE. Publish a beta Web version of SELFDL and RIDDLE for public use/feedback. Better integration between SELFDL and RIDDLE. Investigate awareness and discovery of DLs on the Web.

98 98 A.Frank References Sharon, T. & Frank, A., “Digital Libraries on the Internet”, IFLA'00 66th IFLA Council and General Conference, 13-18, Jerusalem, Israel, August 2000, http://www.ifla.org/IV/ifla66/papers/029-142e.htm http://www.ifla.org/IV/ifla66/papers/029-142e.htm Hanani, U. & Frank, A., “The Parallel Evolution of Search Engines and Digital Libraries: their Convergence to the Mega-Portal”, ICDL'00 Kyoto Intl. Conf. on Digital Libraries: Research and Practice, 269-276, Kyoto, Japan, November 2000, http://csdl.computer.org/comp/proceedings/kyotodl/2000/1022/00/10220211abs.htm http://csdl.computer.org/comp/proceedings/kyotodl/2000/1022/00/10220211abs.htm Yom Tov, N. & Frank, A., “Harnessing Search Engine Technologies to Increase Awareness and Discovery of Digital Libraries”, 4th IEEE Intl. Conf. on IT: Research and Education (ITRE), Tel-Aviv, October 2006. Kadury, A. & Frank, A., “Harvesting and Aggregation of Digital Libraries in the OAI Framework”, WEBIST 2007, 3rd Intl. Conf. on Web Information Systems and Technologies, 441-446, Barcelona, Spain, March 2007.

99 99 A.Frank Bibliography Arms W. Y., Digital Libraries, MIT Press, Cambridge, 2000. Hill, L., Buchel, O., Janée, G. & Lei, Z. M., “Integration of Knowledge Organization Systems into Digital Library Architectures”, Position Paper for 13th ASIS&T SIG/CR Workshop, “Reconceptualizing Classification Research”, 62-68, Philadelphia, PA, 2002. Pace A. K., The Ultimate Digital Library, American Library Association, Chicago, 2003. Lossau N., “Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet”, D-Lib Magazine, Vol. 10, No. 6, June 2004. Summann F. & Lossau N., “Search Engine Technology and Digital Libraries: Moving from Theory to Practice”, D-Lib Magazine Online, Vol. 10, No. 9, September 2004. Lippincott J. K., “Net Generation Students and Libraries”, EDUCAUSE Review, Vol. 40, No. 2, March/April 2005.

100 10 0 A.Frank Still around :-?)


Download ppt "A.Frank 1 Digital Libraries (DL): Awareness and Discovery Ariel Frank Dept. of Computer Science Bar-Ilan University Joint research with Nir Yom Tov, Alon."

Similar presentations


Ads by Google