Download presentation
Presentation is loading. Please wait.
1
August 22, 2001 NASA Ames Lecture -- Ray R. Larson XML Structured Document Retrieval and Distributed Resource Discovery Ray R. Larson School of Information Management & Systems University of California, Berkeley ray@sherlock.berkeley.edu
2
August 22, 2001 NASA Ames Lecture -- Ray R. Larson ContextContext NSF/JISC International Digital Library GrantNSF/JISC International Digital Library Grant –Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric and Spatial Data UC Berkeley DLI2 Grant:UC Berkeley DLI2 Grant: –ReInventing Scholarly Information Access UC Berkeley working with the University of Liverpool/Manchester Computing with participation fromUC Berkeley working with the University of Liverpool/Manchester Computing with participation from –DeMontfort University (MASTER) –Art and Humanities Data Service (http://ahds.ac.uk/) OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria)OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria) –Consortium of University Research Libraries (CURL) –UC Berkeley Library (and California Digital Library) Making of America IIMaking of America II Online Archive of CaliforniaOnline Archive of California –British Natural History Museum, London –NESSTAR (NEtworked Social Science Tools and Resources)
3
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Research Areas Goals areGoals are –Practical application of existing DL technologies to some large-scale cross-domain collections –Theoretical examination and evaluation of next- generation designs for systems architecture and and distributed cross-domain searching for DLs
4
August 22, 2001 NASA Ames Lecture -- Ray R. Larson ApproachApproach For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval systemFor the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system Databases include:Databases include: –HE Archives hub –Arts and Humanities Data Service (AHDS) –MASTER –CURL (Consortium of University Research Libraries) –Online Archive of California (OAC) –Making of America II (MOA2)
5
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Current Usage of Cheshire II Web clients for:Web clients for: –Berkeley NSF/NASA/ARPA Digital Library –World Conservation Digital Library –SunSite (UC Berkeley Science Libraries) –University of Liverpool –Higher Education Archives Hub Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded)Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded) –University of Essex, HDS (part of AHDS) –Oxford Text Archive (test only) –California Sheet Music Project –Cha-Cha (Berkeley Intranet Search Engine) –Berkeley Metadata project cross-language demo –Univ. of Virginia (test implementations) –Cheshire ranking algorithm is basis for original Inktomi
6
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Current and Upcoming Usage of Cheshire II DIEPER Digitized European Periodicals project.DIEPER Digitized European Periodicals project. –http://gdz.sub.uni-goettingen.de/dieper/ NESSTAR (Networked Social Science Tools and Resources.NESSTAR (Networked Social Science Tools and Resources. –http://www.nesstar.org/ FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR)FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR) –http://www.faster-data.org/ MASTER (Manuscript Access through Standards for Electronic Records.MASTER (Manuscript Access through Standards for Electronic Records. –http://www.cta.dmu.ac.uk/projects/master/
7
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Upcoming Usage of Cheshire II ZETOC (Prototype of the Electronic Table of Contents from the British Library)ZETOC (Prototype of the Electronic Table of Contents from the British Library) –http://zetoc.mimas.ac.uk/ Archives HubArchives Hub –http://www.archiveshub.ac.uk/ RSLP Palaeography projectRSLP Palaeography project –http://www.palaeography.ac.uk/ British Natural History Museum, LondonBritish Natural History Museum, London JISC data services directory hosted by MIMASJISC data services directory hosted by MIMAS Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide searchResource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide search
8
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Client/Server Architecture Server Supports:Server Supports: –Database storage –Indexing –Z39.50 access to local data –Boolean and Probabilistic Searching –Relevance Feedback –External SQL database support Client Supports:Client Supports: –Programmable (Tcl/Tk) Graphical User Interface –Z39.50 access to remote servers –SGML/XML & MARC formatting Combined Client/Server CGI scripting via WebCheshire used for web applicationsCombined Client/Server CGI scripting via WebCheshire used for web applications
9
August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML Support Underlying native format for all data is SGML/XMLUnderlying native format for all data is SGML/XML The DTD defines the file format for each fileThe DTD defines the file format for each file Full SGML/XML parsingFull SGML/XML parsing XML Configuration Files define the databaseXML Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again)USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML tagsAccess to full-text via special SGML tags Support for SGML/XML component definition and indexingSupport for SGML/XML component definition and indexing
10
August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML Support Example XML record for a DL documentExample XML record for a DL document ELIB-v1.0 756 June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE
11
August 22, 2001 NASA Ames Lecture -- Ray R. Larson 00722 n a m 2 2 00229 4 5 0 00100140000000500170001400800410003101000140007203500200008603500170010 610000190012324501050014225000110024726000320025830000330029050400500032365000360 0373700002200409700002200431950003200453998000700485 CUBGGLAD1282B 19940414143202.0 830810 1983 nyu eng u 82019962 (CU)ocm08866667 (CU)GLAD1282 Burch, John G. Information systems : theory and practice / John G. Burch, Jr., Felix R. Strater, Gary Grudnitski 3rd ed New York : J. Wiley, 1983 xvi, 632 p. : ill. ; 24 cm Includes bibliographical references and index Management information systems.... SGML/XML Support Example SGML/MARC RecordExample SGML/MARC Record
12
August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML Support Configuration files for the Server are also SGML/XML:Configuration files for the Server are also SGML/XML: –They include tags describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. –They include definition of components and component indexes
13
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Component Extraction and Retrieval Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”.Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”. Components can be ranked and retrieved independently of the source document (but linked back to their original source)Components can be ranked and retrieved independently of the source document (but linked back to their original source) For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level searchFor example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search Example: Glassier archives…Example: Glassier archives…
14
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Component Extraction and Retrieval The Glassier archive is an EAD document (1.9 Mb in size)The Glassier archive is an EAD document (1.9 Mb in size) Contains “Series, Subseries, and Item level” descriptions of things in the archiveContains “Series, Subseries, and Item level” descriptions of things in the archive
15
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Excerpt from Glasier Archive GP-1-1: General correspondence. Public letters. GP-1-1 Glasier Papers. General correspondence. Public letters. Arrangement Public letters arranged alphabetically within each year GP-1-1-0001 Letter from Richard Murray. Glasgow ; <unitdate > 7 Apr 1879. Murray, Richard 1 letter Employment reference for J.B.G. as draughtsman Glasier, John Bruce ETC….
16
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Example Component Def … /home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 NONE c level item …
17
August 22, 2001 NASA Ames Lecture -- Ray R. Larson ComponentsComponents Both individual tags and “ranges” with a starting tag and (different) ending tag can be used as componentsBoth individual tags and “ranges” with a starting tag and (different) ending tag can be used as components Components permit parts of complex SGML/XML documents to be treated as separate documentsComponents permit parts of complex SGML/XML documents to be treated as separate documents
18
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50
19
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Boolean Search Capability All Boolean operations are supportedAll Boolean operations are supported –“zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the serverNamed sets are supported and stored on the server Boolean operations between stored sets are supportedBoolean operations between stored sets are supported –“zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supportedNested parentheses and truncation are supported –“zfind xtitle Alice#”
20
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given queryRigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Rely on accurate estimates of probabilitiesRely on accurate estimates of probabilities
21
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probability Ranking Principle If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977
22
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Models: Logistic Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:
23
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Logistic Regression 100 - 90 - 80 - 70 - 60 - 50 - 40 - 30 - 20 - 10 - 0 - 0 10 20 30 40 50 60 Term Frequency in Document Relevance
24
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide
25
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged
26
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time.Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time. Z39.50 “relevance” operator used to indicate probabilistic searchZ39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed: –zfind topic @ “cheshire cats, looking glasses, march hares and other such things” –zfind title @ caucus races Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined: –zfind topic @ government documents and title guidebooks
27
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Combining Search Types It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system)It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) –E.g.: –Search of Full Text (Probabilistic) –Search of Full Text (Boolean) –Search of Components (Probabilistic) –Search of Titles (Probabilistic) –Search of Subject Headings (Probabilistic) All result sets are merged and re-ranked to produce the final list.All result sets are merged and re-ranked to produce the final list.
28
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Relevance Feedback. Any records in a result set can be used for Relevance FeedbackAny records in a result set can be used for Relevance Feedback Uses the “set name” to receive feedback instructions.Uses the “set name” to receive feedback instructions. –zfind SET1:2,5-9,30,45 –zfind SET2:6 Chosen records are used to build a new probabilistic queryChosen records are used to build a new probabilistic query Ranked results are returnedRanked results are returned Planned support for (modified) Rocchio RFPlanned support for (modified) Rocchio RF
29
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Two-Stage Retrieval (EVM generation) Example: Using the LC Classification SystemExample: Using the LC Classification System –Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.) –Permits searching by any term in the class –Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. –User selects classes to feed back for the “second stage” search of documents (which includes info from first stage selections) Can be used with any classified/Indexed collection and controlled vocabularyCan be used with any classified/Indexed collection and controlled vocabulary
30
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme
31
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Cluster Generation Define basis for clustering records.Define basis for clustering records. –Select field to form the basis of the cluster. –Evidence Fields to use as contents of the pseudo- documents. During indexing cluster keys are generated with basis and evidence from each record.During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields.Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields.Pseudo-Documents (Class clusters) are indexed on combined evidence fields.
32
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Two-Stage Retrieval Using the Mesh Subject Heading SystemUsing the Mesh Subject Heading System –Pseudo-Document created for each MESH heading containing terms derived from “content-rich” portions of documents in that class (other subject headings, titles, abstract, etc.) –Permits searching by any term in the class –Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. –User selects classes to feed back for the “second stage” search of documents. Can be used with any classified/Indexed collection.Can be used with any classified/Indexed collection.
33
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Distributed Search: The Problem Hundreds or Thousands of servers with databases ranging widely in content, topic, formatHundreds or Thousands of servers with databases ranging widely in content, topic, format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? What to search firstWhat to search first Which to search nextWhich to search next –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text…)
34
August 22, 2001 NASA Ames Lecture -- Ray R. Larson An Approach for Cross-Domain Resource Discovery MetaSearchMetaSearch –New approach to building metasearch based on Z39.50 –Instead of using broadcast search we are using two Z39.50 Services Identification of database metadata using Z39.50 ExplainIdentification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCANExtraction of distributed indexes using Z39.50 SCAN EvaluationEvaluation –How efficiently can we build distributed indexes? –How effectively can we choose databases using the index? –How effective is merging search results from multiple sources? –Hierarchies of servers (general/meta-topical/individual)?
35
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine
36
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Explain Explain supports searches forExplain supports searches for –Server-Level metadata Server NameServer Name IP AddressesIP Addresses PortsPorts –Database-Level metadata Database nameDatabase name Search attributes (indexes and combinations)Search attributes (indexes and combinations) –Support metadata (record syntaxes, etc)
37
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 SCAN Originally intended to support BrowsingOriginally intended to support Browsing Query forQuery for –Database –Attributes plus Term (i.e., index and start point) –Step Size –Number of terms to retrieve –Position in Response set ResultsResults –Number of terms returned –List of Terms and their frequency in the database (for the given attribute combination)
38
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 SCAN Results % zscan title cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
39
August 22, 2001 NASA Ames Lecture -- Ray R. Larson MetaSearch Server Index Creation For all servers, or a topical subset…For all servers, or a topical subset… –Get Explain information (especially DC mappings) –For each index (or each DC index) Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency Add term + freq + source index + database metadata to the metasearch “Collection Document” (XML)Add term + freq + source index + database metadata to the metasearch “Collection Document” (XML) –Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of dataPost-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes
40
August 22, 2001 NASA Ames Lecture -- Ray R. Larson MetaSearch Approach MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5
41
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Known Problems Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain Solutions:Solutions: –Probing for attributes instead of explain (e.g. DC attributes or analogs) –We also support OAI and can extract OAI metadata for servers that support OAI Collection Documents are static and need to be replaced when the associated collection changesCollection Documents are static and need to be replaced when the associated collection changes
42
August 22, 2001 NASA Ames Lecture -- Ray R. Larson EvaluationEvaluation Test EnvironmentTest Environment –TREC Tipster and FT data (approx. 3.5 GB) –Partitioned into 236 smaller collections based on source and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.) High size variability (Range from 1 to thousands of docs)High size variability (Range from 1 to thousands of docs) 21,225,299 Words, 142,345,670 chars total for harvested records21,225,299 Words, 142,345,670 chars total for harvested records Efficiency (old data)Efficiency (old data) –Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) –Average of 14.07 seconds excluding FT (131 seconds for FT database with 7 indexes) –Now collecting more information – so longer harvest times longer, but still under one minute on average
43
August 22, 2001 NASA Ames Lecture -- Ray R. Larson EvaluationEvaluation EffectivenessEffectiveness –Still working on evaluation comparing our DB ranking with the TIPSTER relevance judgements –Can be compared with published selection methods (CORI, GlOSS, etc.) using the same testbed
44
August 22, 2001 NASA Ames Lecture -- Ray R. Larson FutureFuture Testing of variant algorithms for ranking collectionsTesting of variant algorithms for ranking collections Application to real systems and testing in a production environment (Archives Hub)Application to real systems and testing in a production environment (Archives Hub) Logically Clustering servers by topicLogically Clustering servers by topic Meta-Meta Servers (treating the MetaSearch database as just another database)Meta-Meta Servers (treating the MetaSearch database as just another database)
45
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers
46
August 22, 2001 NASA Ames Lecture -- Ray R. Larson ConclusionConclusion A lot of interesting work to be doneA lot of interesting work to be done –Redesign and development of the Cheshire II system –Evaluating new meta-indexing methods –Developing and Evaluating methods for merging cross-domain results (or, perhaps, when to keep them separate)
47
August 22, 2001 NASA Ames Lecture -- Ray R. Larson Further Information Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation –Also on Berkeley Digital Library Software Distribution CD Project Web Site http://cheshire.berkeley.edu/Project Web Site http://cheshire.berkeley.edu/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.