August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University of California, Berkeley ray@sims.berkeley.edu

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson OverviewOverview The problem areaThe problem area Distributed searching tasks and issuesDistributed searching tasks and issues Our approach to resource characterization and searchOur approach to resource characterization and search Experimental evaluation of the approachExperimental evaluation of the approach Application and use of this method in working systemsApplication and use of this method in working systems

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson The Problem Prof. Casarosa’s definition of the Digital Library vision in yesterday afternoons plenary session -- Access to everyone for “all human knowledge”Prof. Casarosa’s definition of the Digital Library vision in yesterday afternoons plenary session -- Access to everyone for “all human knowledge” Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web” Hundreds or Thousands of servers with databases ranging widely in content, topic, formatHundreds or Thousands of servers with databases ranging widely in content, topic, format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? Which resource to search first?Which resource to search first? Which to search next if more is wanted?Which to search next if more is wanted? –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text, multimedia…)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed Search Tasks Resource DescriptionResource Description –How to collect metadata about digital libraries and their collections or databases Resource SelectionResource Selection –How to select relevant digital library collections or databases from a large number of databases Distributed SearchDistributed Search –How to perform parallel or sequential searching over the selected digital library databases Data FusionData Fusion –How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson An Approach for Distributed Resource Discovery Distributed resource representation and discoveryDistributed resource representation and discovery –New approach to building resource descriptions based on Z39.50 –Instead of using broadcast search across resources we are using two Z39.50 Services Identification of database metadata using Z39.50 ExplainIdentification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCANExtraction of distributed indexes using Z39.50 SCAN EvaluationEvaluation –How efficiently can we build distributed indexes? –How effectively can we choose databases using the index? –How effective is merging search results from multiple sources? –Can we build hierarchies of servers (general/meta- topical/individual)?

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Z39.50 Explain Explain supports searches forExplain supports searches for –Server-Level metadata Server NameServer Name IP AddressesIP Addresses PortsPorts –Database-Level metadata Database nameDatabase name Search attributes (indexes and combinations)Search attributes (indexes and combinations) –Support metadata (record syntaxes, etc)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Z39.50 SCAN Originally intended to support BrowsingOriginally intended to support Browsing Query forQuery for –Database –Attributes plus Term (i.e., index and start point) –Step Size –Number of terms to retrieve –Position in Response set ResultsResults –Number of terms returned –List of Terms and their frequency in the database (for the given attribute combination)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Z39.50 SCAN Results % zscan title cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Resource Index Creation For all servers, or a topical subset…For all servers, or a topical subset… –Get Explain information –For each index Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency Add term + freq + source index + database metadata to the XML “Collection Document” for the resourceAdd term + freq + source index + database metadata to the XML “Collection Document” for the resource –Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of dataPost-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson MetaSearch Approach MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Known Issues and Problems Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain Solutions that appear to work well:Solutions that appear to work well: –Probing for attributes instead of explain (e.g. DC attributes or analogs) –We also support OAI and can extract OAI metadata for servers that support OAI –Query-based sampling (Callan) Collection Documents are static and need to be replaced when the associated collection changesCollection Documents are static and need to be replaced when the associated collection changes

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson EvaluationEvaluation Test EnvironmentTest Environment –TREC Tipster data (approx. 3 GB) –Partitioned into 236 smaller collections based on source and date by month (no DOE) High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records) Same database as used in other distributed search studies by J. French and J. Callan among othersSame database as used in other distributed search studies by J. French and J. Callan among others –Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson TREC Disk Source Size MB Size doc 1 WSJ (86-89) 27098,732 1 AP (89) 25984,678 1ZIFF24575,180 1 FR (89) 26225,960 2 WSJ (90-92) 24774.520 2 AP (88) 24179,919 2ZIFF17856,920 2 FR (88) 21119,860 3 AP (90) 24278,321 3 SJMN (91) 29090,257 3PAT2456,711 Totals2,690691,058 Test Database Characteristics

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson TREC Disk Source Num DB Total DB 1 WSJ (86-89) 29 Disk 1 1 AP (89) 1267 1ZIFF14 1 FR (89) 12 2 WSJ (90-92) 22 Disk 2 2 AP (88) 1154 2ZIFF 11 (1 dup) 2 FR (88) 10 3 AP (90) 12 Disk 3 3 SJMN (91) 12116 3PAT92 Totals 237 - 1 Test Database Characteristics

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Harvesting Efficiency Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the networkAverage of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network Average of 14.07 secondsAverage of 14.07 seconds Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds.Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds.

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Our Collection Ranking Approach We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeWe attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by:

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Statistics Used for Regression Variables Average Absolute Query FrequencyAverage Absolute Query Frequency Query LengthQuery Length Average Absolute Collection FrequencyAverage Absolute Collection Frequency Collection size estimateCollection size estimate Average Inverse Collection FrequencyAverage Inverse Collection Frequency Number of terms in common between query and collection representativeNumber of terms in common between query and collection representative (Details in the proceedings)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Other Approaches GlOSS – Developed by the DL project at Stanford Univ. Uses fairly conventional TFIDF rankingGlOSS – Developed by the DL project at Stanford Univ. Uses fairly conventional TFIDF ranking CORI – Developed by J. Callan and students at CIIR. Uses a ranking that exploits some of the features of the INQUERY system in merging evidenceCORI – Developed by J. Callan and students at CIIR. Uses a ranking that exploits some of the features of the INQUERY system in merging evidence

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson EvaluationEvaluation EffectivenessEffectiveness –Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements –Testing by comparing our approach to known algorithms for ranking collections –Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) –Recall analog (How many of the Rel docs occurred in the top n databases – averaged)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Titles only (short query)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Long Queries

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Very Long Queries

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Current Usage Mersey LibrariesMersey Libraries Distributed Archives HubDistributed Archives Hub Related approachesRelated approaches –JISC Resource Discovery Network (OAI-MHP Harvesting with Cheshire Search)(OAI-MHP Harvesting with Cheshire Search) –Planned use with TEL by the BL

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson FutureFuture Logically Clustering servers by topicLogically Clustering servers by topic Meta-Meta Servers (treating the MetaSearch database as just another database)Meta-Meta Servers (treating the MetaSearch database as just another database)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson ConclusionConclusion A practical method for metadata harvesting and an effective algorithm for distributed resource discoveryA practical method for metadata harvesting and an effective algorithm for distributed resource discovery Further researchFurther research –Continuing development of the Cheshire III system –Applicability of language modelling methods to resource discovery –Developing and Evaluating methods for merging cross- domain results, such as text and image or text and GIS datasets (or, perhaps, when to keep them separate)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Further Information Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site http://cheshire.berkeley.edu/Project Web Site http://cheshire.berkeley.edu/

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson CORI ranking

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Measures for Evaluation Assume each database has some merit for a given query, qAssume each database has some merit for a given query, q Given a Baseline ranking B and an estimated (test) ranking E forGiven a Baseline ranking B and an estimated (test) ranking E for Let db bi and db ei denote the database in the i- th ranked position of rankings B and ELet db bi and db ei denote the database in the i- th ranked position of rankings B and E Let B i = merit(q, db bi ) and E i = merit(q, db ei )Let B i = merit(q, db bi ) and E i = merit(q, db ei ) We can define some measures:We can define some measures:

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Measures for Evaluation – Recall Analogs

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Measures for Evaluation – Precison Analog

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.

Similar presentations

Presentation on theme: "August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.

Similar presentations

Presentation on theme: "August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University."— Presentation transcript:

Similar presentations

About project

Feedback