Information Integration for Digital Libraries August 10, 2000 Prof. Sang Ho Lee Soongsil University Seoul, Korea shlee@computing.soongsil.ac.kr
Information integration Provision of integrated access to multiple, distributed, heterogeneous databases and other information sources Mediator approach More up-to-date data No need to copy data Query needs can be unknown Data warehouse approach High query performance Can operate when sources unavailable Extra information at warehouse Modify, summarize (store aggregates), add historical information
Mediator Approach Client Wrapper Mediator Source
Data Warehouse Approach Client Client Query & Analysis Warehouse Metadata Integration Source Source Source
Web Searching Practice Approx. 800 million indexable Web pages (Feb. 1999) Low coverage of the Web No engine indexing more than 16% of indexable web pages Out of date New pages take months to be indexed Low metadata use 34% use “keywords” or “description” metatags 0.3% use the Dublin Core metadata standard Simple queries Most queries use 1-3 search words Poor relevancy ranking and precision
Meta Search engines USA Korea SavvySearch (www.savvysearch.com) MetaCrawler (www.go2net.com/search.html) Ask Jeeves (www.askjeeves.com) ProFusion (www.profusion.com) Mamma (www.mamma.com) Ixquick (www.ixquick.com) Korea Wakano (www.wakano.co.kr) Ms. DaChanni (www.mochanni.com) Over 3000 metasearch engines around the world
Operation Flow and Technical Issues User query Decompose and format queries Send queries and get results Post processing (ranking, clustering, etc.) Output result
Current Practice of Metasearch Engines Tend to a least-common-denominator interface Not utilize function of individual sources completely Covers general area, not a specific area Little utilization of domain knowledge Little consideration to personal profiles
Proposed Research Topics (1) Theme: focused on mediator-based integration techniques (in particular, metasearch engines) Intelligent wrapper techniques To extract, combine, and reconcile information for external sources Exploit user profiles and utilize function of each sources as much as possible Should be flexible and adaptable, as external sources change Several approaches Formal language based, machine learning based, heuristic based, extended CFG based, …
Proposed Research Topics (2) Efficiency issues How to cache results and queries, to provide a fast response to users How to do parallelism when accessing external sources
Research/Development Strategies Categorize objects and develop specialized search mechanism for each category Build a working system to experiment theories Experiment new ranking methods Google, Goto, …