11/15/2001Database Management -- Spring 2001 -- R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 10: Designing Databases
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Creation of an online catalog of dissertations using Access & ASP – slide 1 Creation of an online catalog of dissertations using Access & ASP: from Datatel.
StatCat Building a Statistical Data Finder ssrs.yale.edu/statcat Steven Citron-Pousty Ann Green Julie Linden Yale University.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
For Mapping Biodiversity Data Data Management Options.
Information Retrieval in Practice
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.
11/20/2001Database Management -- Spring R. Larson Databases and the Future University of California, Berkeley School of Information Management.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
SLIDE 1IS Fall 2002 Database Applications -- The UC Berkeley Environmental Digital Library University of California, Berkeley School.
11/21/2000Database Management -- Spring R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
SLIDE 1IS Fall 2004 Data-Driven Digital Library Applications -- The UC Berkeley Environmental Digital Library University of California,
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
SLIDE 1IS 257 – Spring 2004 Object-Relational Database System Features University of California, Berkeley School of Information Management.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Development of Japanese GIS Tool for use in the Humanities ○ Masatoshi ISHIKAWA †, Yoichi KAWANISHI ††, Hidefumi OKUMURA †††, Shoichiro HARA †††† † University.
Angelika Menne-Haritz The MEX editor - METS and the presentation of digitised archives The MEX editor: METS and the Internet presentation of.
Overview of Search Engines
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Digital Library Architecture and Technology
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
University of Illinois at Urbana-Champaign OAI Alpha Experiences Timothy W. Cole Thomas G. Habing Grainger Engineering.
HEALTH DEVELOPMENT AGENCY ONLINE INFORMATION RESOURCES Heidi Livingstone Marta Calonge Contreras.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
3/20/2000Principles of Information Retrieval Digital Libraries – Issues & Geographic Information Retrieval University of California, Berkeley School of.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Artifact: an online resource for Art, Design and Music and the Performing Arts Mary Burslem JISC RSC for London.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Overview Finding and importing data sets –Searching for data –Importing data_.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
ALA Annual Meeting Claire Cocco Global Product Manager CONTENTdm Users Group June 30th, 2008.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
General Architecture of Retrieval Systems 1Adrienn Skrop.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Alexandria Digital Library The ADL Testbed Greg Janée
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
Information Retrieval in Practice
Search Engine Architecture
Chapter 27 WWW and HTTP.
Introduction to Information Retrieval
Presentation transcript:

11/15/2001Database Management -- Spring R. Larson Object-Relational Database Applications -- The UC Berkeley Environmental Digital Library University of California, Berkeley School of Information Management and Systems SIMS 257: Database Management

11/15/2001Database Management -- Spring R. Larson Today Object Relational Database Applications –The Berkeley Digital Library Project Slides from RRL and Robert Wilensky, EECS –Use of DBMS in DL project.

11/15/2001Database Management -- Spring R. Larson Final Presentations and Reports Specifications for final report are on the Web Site under assignments Presentations (Nov 27 th & 30 th, Dec 4 th and 6 th ) –Signup sheet being passed around.

11/15/2001Database Management -- Spring R. Larson Today Object Relational Applications The UCB Digital Library

11/15/2001Database Management -- Spring R. Larson Overview What is an Digital Library? Overview of Ongoing Research on Information Access in Digital Libraries

11/15/2001Database Management -- Spring R. Larson Digital Libraries Are Like Traditional Libraries... Involve large repositories of information (storage, preservation, and access) Provide information organization and retrieval facilities (categorization, indexing) Provide access for communities of users (communities may be as large as the general public or small as the employees of a particular organization)

11/15/2001Database Management -- Spring R. Larson Originators Libraries Users Traditional Library System

11/15/2001Database Management -- Spring R. Larson But Digital Libraries Are Different From Libraries... Not a physical location with local copies; objects held closer to originators Decoupling of storage, organization, access Enhanced Authoring (origination, annotation, support for work groups) Subscription, pay-per-view supported in addition to “free” browsing. Integration into user tasks.

11/15/2001Database Management -- Spring R. Larson Originators Repositories Users A Digital Library Infrastructure Model Index Services Network

11/15/2001Database Management -- Spring R. Larson UC Berkeley Digital Library Project Foci: Work-centered digital information services and Re-Inventing Scholarly Information Testbed: Digital Library for the California Environment Research: Technical agenda supporting user- oriented access to large distributed collections of diverse data types. Part of the NSF/NASA/DARPA Digital Library Initiative (Phases 1 and 2, and the International DL initiative)

11/15/2001Database Management -- Spring R. Larson UCB Digital Library Project: Research Organizations UC Berkeley EECS, SIMS, CED, IS&T UCOP Xerox PARC’s Document Image Decoding group and Work Practices group Hewlett-Packard NEC SUN Microsystems IBM Almaden Microsoft Ricoh California Research Philips Research

11/15/2001Database Management -- Spring R. Larson Collection: Diverse material relevant to California’s key habitats. Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries. Potential: Impact on state-wide environmental system (CERES ) Testbed: An Environmental Digital Library

11/15/2001Database Management -- Spring R. Larson The Environmental Library - Users/Contributors California Resources Agency, California Environment Resources Evaluation System (CERES) California Department of Water Resources The California Department of Fish & Game SANDAG UC Water Resources Center Archives New Partners: CDL and SDSC

11/15/2001Database Management -- Spring R. Larson The Environmental Library - Contents Environmental technical reports, bulletins, etc. County general plans Aerial and ground photography USGS topographic maps Land use and other special purpose maps Sensor data “Derived” information Collection data bases for the classification and distribution of the California biota (e.g., SMASCH) Supporting 3-D, economic, traffic, etc. models Videos collected by the California Resources Agency

11/15/2001Database Management -- Spring R. Larson The Environmental Library - Contents As of late 2000, the collection represents about one terabyte of data, including over 165,000 digital images, about 300,000 pages of environmental documents, and nearly 2 million records in geographical and botanical databases.

11/15/2001Database Management -- Spring R. Larson Botanical Data:  The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 600,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.

11/15/2001Database Management -- Spring R. Larson Geographical Data:  Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000- records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.

11/15/2001Database Management -- Spring R. Larson Documents:  Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species.

11/15/2001Database Management -- Spring R. Larson Documents - cont.  The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.

11/15/2001Database Management -- Spring R. Larson Image Data The photo collection includes over 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, over 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos. These images are used within the project for computer vision research

11/15/2001Database Management -- Spring R. Larson Testbed Success Stories LUPIN: CERES’ Land Use Planning Information Network –California Country General Plans and other environmental documents. –Enter at Resources Agency Server, documents stored at and retrieved from UCB DLIB server. California flood relief efforts –High demand for some data sets only available on our server (created by document recognition). CalFlora: Creation and interoperation of repositories pertaining to plant biology. Cloning of services at Cal State Library, FBI

11/15/2001Database Management -- Spring R. Larson Research Highlights Documents –Multivalent Document prototype Page images, structured documents, GIS data, photographs Intelligent Access to Content –Document recognition –Vision-based Image Retrieval: stuff, thing, scene retrieval –Natural Language Processing: categorizing the web, Cheshire II, TileBar Interfaces

11/15/2001Database Management -- Spring R. Larson Multivalent Documents MVD Model –radically distributed, open, extensible –“behaviors” and “layers” behaviors conform to a protocol suite inter-operation via “IDEG” Applied to “enlivening legacy documents” –various nice behaviors, e.g., lenses

11/15/2001Database Management -- Spring R. Larson Document Presentation Problem: Digital libraries must deliver digital documents -- but in what form? Different forms have advantages for particular purposes –Retrieval –Reuse –Content Analysis –Storage and archiving Combining forms (Multivalent documents)

11/15/2001Database Management -- Spring R. Larson Spectrum of Digital Document Representations Adapted from Fox, E.A., et al. “Users, User Interfaces and Objects: Evision, an Electronic Library”, JASIS 44(8), 1993

11/15/2001Database Management -- Spring R. Larson Document Representation: Multivalent Documents Primary user interface/document model for UCB Digital Library (Wilensky & Phelps) Goal: An approach to new document representations and their authoring. Supports active, distributed, composable transformations of multimedia documents. Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

11/15/2001Database Management -- Spring R. Larson Multivalent Documents Cheshire Layer OCR Layer OCR Mapping Layer History of The Classical World The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj GIS Layer taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Table Layer kdk dkd kdk Scanned Page Image Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Network Protocols & Resources

11/15/2001Database Management -- Spring R. Larson

11/15/2001Database Management -- Spring R. Larson

11/15/2001Database Management -- Spring R. Larson MVD Third Party Work Japanese support by NEC; application to office document management Printing, support for other OCR formats, by HP Chinese character and multilingual lens by UCB Instructional Support staff (Owen McGrath) Automatic enlivening of documents via Transcend proxy.

11/15/2001Database Management -- Spring R. Larson MVD Forthcoming Support for XML + style sheets More robust parsing Saving where you want Media adaptors for –Continuous media –Near image formats, word proc. formats Improve authoring tools Interoperation with paper Application versus applet? Release to community, get feedback, iterate.

11/15/2001Database Management -- Spring R. Larson GIS in the MVD Framework Layers are georeferenced data sets. Behaviors are –display semi-transparently –pan –zoom –issue query –display context –“spatial hyperlinks” –annotations Written in Java (to be merged with MVD-1 code line?)

11/15/2001Database Management -- Spring R. Larson GIS Viewer: Recent Developments Annotation and saving –points, rectangles (w. labels and links), vectors –saving of annotations as separate layer Integration with address, street finding, gazetteer services Application to image viewing: tilePix Castanet client

11/15/2001Database Management -- Spring R. Larson

11/15/2001Database Management -- Spring R. Larson

11/15/2001Database Management -- Spring R. Larson

11/15/2001Database Management -- Spring R. Larson GIS Viewer Example

11/15/2001Database Management -- Spring R. Larson Geographic Information: Plans and Ideas More annotations, flexible saving Support for large vector data sets Interoperability –On-the-fly conversion of formats generation of “catalogs” –Via OGDI/GLTP –Experimenting with various CERES servers

11/15/2001Database Management -- Spring R. Larson Documents: Information from scanned document Built document recognizers for some important documents, e.g. “Bulletin 17”. “TR-9”. Recognized document structure, with order magnitude better OCR. Automatically generated 1395 item dam relational data base. Enabled access via forms, map interfaces. Enable interoperation with image DB.

11/15/2001Database Management -- Spring R. Larson Document Recognition: Future Plans Document recognizers: for ~ dozen document types Development and integration of mathematical OCR and recognition. Eventually produce document recognizer generator, i.e., make it easier to write recognizers.

11/15/2001Database Management -- Spring R. Larson Vision-Based Image Retrieval Stuff-based queries: “blobs” –Basic blobs: colors, sizes, variable number demonstrated utility for interesting queries –“Blob world”: Above plus texture, applied to retrieving similar images successful learning scene classifier Thing-finding: Successfully deployed detectors adding body plans (adding shape, geometry and kinematic constraints) Find objects by grouping coherent low-level properties

11/15/2001Database Management -- Spring R. Larson Image Retrieval Research Finding “Stuff” vs “Things” BlobWorld Other Vision Research

11/15/2001Database Management -- Spring R. Larson (Old “stuff”-based image retrieval: Query)

11/15/2001Database Management -- Spring R. Larson (Old “stuff”-based image retrieval: Result)

11/15/2001Database Management -- Spring R. Larson Blobworld: use regions for retrieval We want to find general objects  Represent images based on coherent regions

11/15/2001Database Management -- Spring R. Larson (“Thing”-based image retrieval using “body plans”: Result)

11/15/2001Database Management -- Spring R. Larson Natural Language Processing Automatic Topic Assignment Developed automatic categorization/disambiguation method to point where topic assignment (but not disambiguation) appears feasible. Ran controlled experiment: –Took Yahoo as ground truth. –Chose 9 overlapping categories; took 1000 web pages from Yahoo as input. –Result: 84% precision; 48% recall (using top 5 of 1073 categories)

11/15/2001Database Management -- Spring R. Larson Distributed Resource Discovery and Structured Data Searching With Cheshire II Ray R. Larson School of Information Management & Systems University of California, Berkeley

11/15/2001Database Management -- Spring R. Larson Research Areas Goals are –Practical application of existing Digital Library technologies to some large-scale cross-domain collections Evaluation of distributed search in cross-domain environment –Theoretical examination and evaluation of next- generation designs for systems architecture and and distributed cross-domain searching for DLs

11/15/2001Database Management -- Spring R. Larson Approach For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system Databases include: –HE Archives hub –Arts and Humanities Data Service (AHDS) –MASTER –CURL (Consortium of University Research Libraries) –Online Archive of California (OAC) –Making of America II (MOA2)

11/15/2001Database Management -- Spring R. Larson Current Usage of Cheshire II Web clients for: –Berkeley NSF/NASA/ARPA Digital Library –World Conservation Digital Library –SunSite (UC Berkeley Science Libraries) –University of Liverpool –Higher Education Archives Hub Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded) –University of Essex, HDS (part of AHDS) –Oxford Text Archive (test only) –California Sheet Music Project –Cha-Cha (Berkeley Intranet Search Engine) –Berkeley Metadata project cross-language demo –Univ. of Virginia (test implementations) –Cheshire ranking algorithm is basis for original Inktomi

11/15/2001Database Management -- Spring R. Larson Current and Upcoming Usage of Cheshire II DIEPER Digitized European Periodicals project. – NESSTAR (Networked Social Science Tools and Resources. – FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR) – MASTER (Manuscript Access through Standards for Electronic Records. –

11/15/2001Database Management -- Spring R. Larson Upcoming Usage of Cheshire II ZETOC (Prototype of the Electronic Table of Contents from the British Library) – Archives Hub – RSLP Palaeography project – British Natural History Museum, London JISC data services directory hosted by MIMAS Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide search

11/15/2001Database Management -- Spring R. Larson Client/Server Architecture Server Supports: –Database storage –Indexing –Z39.50 access to local data –Boolean and Probabilistic Searching –Relevance Feedback –External SQL database support Client Supports: –Programmable (Tcl/Tk) Graphical User Interface –Z39.50 access to remote servers –SGML/XML & MARC formatting Combined Client/Server CGI scripting via WebCheshire used for web applications Mozilla client (under development in Liverpool)

11/15/2001Database Management -- Spring R. Larson SGML/XML Support Example XML record for a DL document ELIB-v June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE

11/15/2001Database Management -- Spring R. Larson n a m CUBGGLAD1282B nyu eng u (CU)ocm (CU)GLAD1282 Burch, John G. Information systems : theory and practice / John G. Burch, Jr., Felix R. Strater, Gary Grudnitski 3rd ed New York : J. Wiley, 1983 xvi, 632 p. : ill. ; 24 cm Includes bibliographical references and index Management information systems.... SGML/XML Support Example SGML/MARC Record

11/15/2001Database Management -- Spring R. Larson Component Extraction and Retrieval Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”. Components can be ranked and retrieved independently of the source document (but linked back to their original source) For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search Example: Glassier archives…

11/15/2001Database Management -- Spring R. Larson Component Extraction and Retrieval The Glassier archive is an EAD document (1.9 Mb in size) Contains “Series, Subseries, and Item level” descriptions of things in the archive

11/15/2001Database Management -- Spring R. Larson Excerpt from Glasier Archive GP-1-1: General correspondence. Public letters. GP-1-1 Glasier Papers. General correspondence. Public letters. Arrangement Public letters arranged alphabetically within each year GP Letter from Richard Murray. Glasgow ; <unitdate > 7 Apr Murray, Richard 1 letter Employment reference for J.B.G. as draughtsman Glasier, John Bruce ETC….

11/15/2001Database Management -- Spring R. Larson Example Component Def … /home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 NONE c level item …

11/15/2001Database Management -- Spring R. Larson Components Both individual tags and “ranges” with a starting tag and (different) ending tag can be used as components Components permit parts of complex SGML/XML documents to be treated as separate documents

11/15/2001Database Management -- Spring R. Larson Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

11/15/2001Database Management -- Spring R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

11/15/2001Database Management -- Spring R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the previous slide

11/15/2001Database Management -- Spring R. Larson Cheshire Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time. Z39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed: –zfind “cheshire cats, looking glasses, march hares and other such things” –zfind caucus races Boolean and Probabilistic elements can be combined: –zfind government documents and title guidebooks

11/15/2001Database Management -- Spring R. Larson Combining Search Types It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) –E.g.: –Search of Full Text (Probabilistic) –Search of Full Text (Boolean) –Search of Components (Probabilistic) –Search of Titles (Probabilistic) –Search of Subject Headings (Probabilistic) All result sets are merged and re-ranked to produce the final list.

11/15/2001Database Management -- Spring R. Larson Distributed Search: The Problem Hundreds or Thousands of servers with databases ranging widely in content, topic, format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? What to search first Which to search next –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text…)

11/15/2001Database Management -- Spring R. Larson An Approach for Cross- Domain Resource Discovery MetaSearch –New approach to building metasearch based on Z39.50 –Instead of using broadcast search we are using two Z39.50 Services Identification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCAN Evaluation –How efficiently can we build distributed indexes? Very… –How effectively can we choose databases using the index? –How effective is merging search results from multiple sources? –Hierarchies of servers (general/meta-topical/individual)?

11/15/2001Database Management -- Spring R. Larson Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine

11/15/2001Database Management -- Spring R. Larson Z39.50 Explain Explain supports searches for –Server-Level metadata Server Name IP Addresses Ports –Database-Level metadata Database name Search attributes (indexes and combinations) –Support metadata (record syntaxes, etc)

11/15/2001Database Management -- Spring R. Larson Z39.50 SCAN Originally intended to support Browsing Query for –Database –Attributes plus Term (i.e., index and start point) –Step Size –Number of terms to retrieve –Position in Response set Results –Number of terms returned –List of Terms and their frequency in the database (for the given attribute combination)

11/15/2001Database Management -- Spring R. Larson Z39.50 SCAN Results % zscan title cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

11/15/2001Database Management -- Spring R. Larson MetaSearch Server Index Creation For all servers, or a topical subset… –Get Explain information (especially DC mappings) –For each index (or each DC index) Use SCAN to extract terms and frequency Add term + freq + source index + database metadata to the metasearch “Collection Document” (XML) –Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes

11/15/2001Database Management -- Spring R. Larson MetaSearch Approach MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5

11/15/2001Database Management -- Spring R. Larson Known Problems Not all Z39.50 Servers support SCAN or Explain Solutions: –Probing for attributes instead of explain (e.g. DC attributes or analogs) –We also support OAI and can extract OAI metadata for servers that support OAI Collection Documents are static and need to be replaced when the associated collection changes

11/15/2001Database Management -- Spring R. Larson Evaluation Test Environment –TREC Tipster and FT data (approx. 3.5 GB) –Partitioned into 236 smaller collections based on source and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.) High size variability (Range from 1 to thousands of docs) 21,225,299 Words, 142,345,670 chars total for harvested records Efficiency (old data) –Average of seconds per database to SCAN each database (3.4 indexes on average) –Average of seconds excluding FT (131 seconds for FT database with 7 indexes) –Now collecting more information – so longer harvest times longer, but still under one minute on average

11/15/2001Database Management -- Spring R. Larson Evaluation Effectiveness –Still working on evaluation comparing our DB ranking with the TIPSTER relevance judgements –Can be compared with published selection methods (CORI, GlOSS, etc.) using the same testbed

11/15/2001Database Management -- Spring R. Larson Future Testing of variant algorithms for ranking collections Application to real systems and testing in a production environment (Archives Hub) Logically Clustering servers by topic Meta-Meta Servers (treating the MetaSearch database as just another database)

11/15/2001Database Management -- Spring R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers

11/15/2001Database Management -- Spring R. Larson Further Information Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site