Panagiotis G. Ipeirotis Tom Barry Luis Gravano

Slides:



Advertisements
Similar presentations
© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.
Advertisements

OAF Workshop, May 13-14, 2002, Pisa.CYCLADES IST CYCLADES An Open Collaborative Virtual Archive Environment Umberto Straccia.
Possibility in Digital Collection Management Introduction to CONTENTdm TM Hitoshi Kamada University of Arizona Presentation for OCLC-CJK Users Group Annual.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
New Features Update ISI Web of Knowledge. Copyright 2006 Thomson Corporation 2 New features added Mozilla Firefox web browser is now supported New access.
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
UCLA Digital Library UC Digital Library Forum August 5, 2002 UCLA Digital Library Presenter: Curtis Fornadley Senior Programmer/Analyst.
Introducing Symposia : “ The digital repository that thinks like a librarian”
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
ALCME: OAI at OCLC Jeffrey A. Young OCLC Online Computer Library Center, Inc.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)
Part 1 – PubMed Interface, Display options, Saving, Printing, and ing results. Instructions This part of the course is a PowerPoint demonstration.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Federated Database Set Up Greg Magsamen ITK478 SIA.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
ABSTRACT The JDBC (Java Database Connectivity) API is the industry standard for database- independent connectivity between the Java programming language.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
IUScholarWorks Technical Overview Randall Floyd Digital Library Program Programmer/Database Administrator.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Implementation of a Relational Database as an Aid to Automatic Target Recognition Christopher C. Frost Computer Science Mentor: Steven Vanstone.
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
PubMed/How to Search, Display, Download & (module 4.1)
Don’t Duck Metadata March 2005 Introducing Setting Up a Clearinghouse Node Topic: Introduction to Setting Up a Clearinghouse Node Objective: By.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
EndNote X2 Training Materials
Middleware independent Information Service
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Enhancing Internet Search Engines to Achieve Concept-based Retrieval
CHAPTER 3 Architectures for Distributed Systems
Building Search Systems for Digital Library Collections
VI-SEEM Data Repository
Databases.
OAI and Metadata Harvesting

Context Interoperability Submission Search Preservation
IL Step 3: Using Bibliographic Databases
SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching
Panos Ipeirotis Luis Gravano
Panagiotis G. Ipeirotis Luis Gravano
DATABASES WHAT IS A DATABASE?
Presentation transcript:

Panagiotis G. Ipeirotis Tom Barry Luis Gravano Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiative Panagiotis G. Ipeirotis Tom Barry Luis Gravano Computer Science Dept., Columbia University

Metasearching? Why? “Surface” Web vs. “Hidden” Web Link structure Crawlable “Hidden” Web Documents “hidden” in databases No link structure Search engines do not index them Need to query each collection individually 9/21/2018 Columbia University Computer Science Dept.

Metasearching Challenges “Content summaries” of databases (frequencies of words) wireless: 2,000 network: 8,000 ... wireless: 0 network: 10 wireless: 5 network: 40 Select good databases for a given query Evaluate the query at these databases Merge the results from these databases Uniform interfaces Hidden Web Metasearcher Existing Web Database Non-indexed Documents Relational Database / Library / etc. 9/21/2018 Columbia University Computer Science Dept.

Outline Background: SDARTS, SDLIP, STARTS Extracting content summaries from remote web databases Interfacing with Open Archives Initiative 9/21/2018 Columbia University Computer Science Dept.

SDARTS: SDLIP + STARTS NOT yet another protocol Metasearcher SDLIP interfaces STARTS metadata S M S M S M grep cat select http://…. S M = Search = Metadata 9/21/2018 Columbia University Computer Science Dept.

STARTS: A Metasearching Protocol Defines: Query language Results format Metadata for the collection Complements SDLIP for metasearching purposes Provides metadata for individual documents Provides content summaries for databases PubMed content summary number of documents = 3,868,552 … cancer  1,398,178 heart  281,506 hepatitis  23,481 basketball  907 9/21/2018 Columbia University Computer Science Dept.

Customization requires just editing configuration files, SDARTS: The Toolkit SDARTS architecture makes new-wrapper implementation easy SDARTS toolkit includes reference implementations for common types of text databases: Local text databases Local XML databases Remote web databases Customization requires just editing configuration files, no programming 9/21/2018 Columbia University Computer Science Dept.

SDARTS Content Summaries Detailed content summaries easily extracted from locally available (plain-text or XML) databases Detailed content summaries so far not available for remote web databases No access to full contents 9/21/2018 Columbia University Computer Science Dept.

Extracting Content Summaries from Remote Web Databases No direct access to remote documents Resort to document sampling: Send queries to the database Retrieve a representative document sample Use the sample to create an approximation of the content summary Database selection algorithms work well even with approximate content summaries VLDB 2002 9/21/2018 Columbia University Computer Science Dept.

Topic-based Sampling: Training Start with a predefined hierarchy and associated, pre-classified documents Train rule-based document classifiers for each node The output is a set of rules like: ibm AND computers → Computers lung AND cancer → Health … hepatitis AND liver → Hepatitis angina → Heart } Root } Health 9/21/2018 Columbia University Computer Science Dept.

Topic-based Sampling: Probing Transform each rule into a query For each query: Send query to database Record number of matches Retrieve top-k documents for query At the end of the round: Analyze matches for each category Choose category to focus on The result is a representative document sample Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries to the database 9/21/2018 Columbia University Computer Science Dept.

Sample Contains “Relative” Word Frequencies “Liver” appears in 200 out of 300 documents in sample “Kidney” appears in 100 out of 300 documents in sample “Hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? Query “liver” returned 140,000 matches Query “hepatitis” returned 20,000 matches “kidney” was not a query probe… Can exploit number of matches from one-word queries 9/21/2018 Columbia University Computer Science Dept.

Adjusting Document Frequencies We know absolute document frequency f of words from one-word queries We know ranking r of words according to document frequency in sample Mandelbrot’s formula connects word frequency f and ranking r We use curve-fitting to estimate the absolute frequency of all words in sample 9/21/2018 Columbia University Computer Science Dept.

Implementing Content-Summary Extraction in SDARTS Toolkit Implemented content-summary extraction module as J2EE-compliant servlet First, build SDARTS wrapper for remote web database Then, trigger extraction process to generate content summary automatically Module customizable with any classification scheme Toolkit provides 72-node hierarchical scheme and associated classifiers To add new scheme, should define the hierarchy and provide classifiers for the internal nodes 9/21/2018 Columbia University Computer Science Dept.

Fraction of PubMed Content Summary number of documents = 3,868,552 … cancer  1,398,178 aids  106,512 heart  281,506 angina  26,775 hepatitis  23,481 basketball  907 cpu  487 Extracted automatically ~ 27,500 words in the extracted content summary Less than 200 queries sent Retrieved 4 documents per query The extracted content summary accurately represents size and contents of the database 9/21/2018 Columbia University Computer Science Dept.

Topic-based Sampling: Conclusions SDARTS now supports extraction of detailed content summaries from any database, local or remote Sophisticated database selection algorithms can now be implemented on top of SDARTS Implemented and available for download: Database Selection Module SDARTS Client with Database Selection 9/21/2018 Columbia University Computer Science Dept.

Interfacing with Open Archives Initiative (OAI) “No man is an island, entire of itself; every man is a piece of the continent, a part of the main...…” (John Donne) OAI Service Provider Export SDARTS metadata under OAI Access transparently any OAI collection through SDARTS SDARTS/SDLIP Server OAI Data Provider SDARTS Client 9/21/2018 Columbia University Computer Science Dept.

Exporting SDARTS Metadata under OAI SDARTS supports detailed, record-level metadata for each document, for XML and plain-text collections Easy mapping to Dublin Core SDARTS also exports content summaries under OAI Each SDARTS collection is mapped to an OAI set We export the content summaries under OAI, as metadata about the set <PAPER> <TITLE>The threat of vancomycin resistance</TITLE> <AUTHORS>Trish M. Perl MD, MSc</AUTHORS> <FILENO>ajm_106_05_0489</FILENO> <APPEARED> <JRNL>American Journal of Medicine</JRNL> <VOL>106</VOL><ISS>5</ISS> <DATE>3 May </DATE> <YEAR>1999</YEAR> </APPEARED> <ABSTRACT>  … </ABSTRACT> <BODY> … </BODY> </PAPER> COLUMBIA SDARTS Server PubMed Publications Aides Medical Collection NOAH: New York Online Access to Health Cardiovascular Institute of the South Columbia's DLI2 Medical Corpus Harrisons Online 9/21/2018 Columbia University Computer Science Dept.

SDARTS OAI Sever: Details Uses OCLC OAI Server Uses MySQL –via JDBC– to store OAI records Records materialized after first request for space efficiency Distributed as WAR file Simple configuration: Specify SDARTS/MySQL address OAI Service Provider SDARTS OAI Interface JDBC SDARTS Server MySQL RDBMS 9/21/2018 Columbia University Computer Science Dept.

Searching OAI Collections OAI is not designed for searching Possible to restrict only “Date” and “Set” Need to search OAI collections Users want to specify “Title”, “Author”, etc. OAI Service Provider Author = “F. Douglass” OAI Data Provider (e.g., Library of Congress ) User ? Author = “F. Douglass” 9/21/2018 Columbia University Computer Science Dept.

Harvesting and Searching OAI within SDARTS OAI Data Provider (e.g., Library of Congress ) OAI exports metadata records in XML SDARTS can index and search XML collections Solution: Harvest OAI records (by “Date”, “Set”) Store records locally as XML documents Use SDARTS XML wrapper to index them Harvest OAI/XML records SDARTS/SDLIP Server Index OAI/XML records The OAI collection is searchable as an SDARTS XML database 9/21/2018 Columbia University Computer Science Dept.

Adding an OAI Collection in SDARTS http://memory.loc.gov/cgi-bin/oai loc 2002-01-01 9/21/2018 Columbia University Computer Science Dept.

Distributed Search over OAI VT Electronic Thesis & Dissertation number of documents = 2,948 … study  1,479 thesis  493 cancer  13 basketball  2 SDARTS treats OAI collections as simple, local XML databases Exact content summaries are exported for OAI collections Possible to build sophisticated distributed search over OAI using SDARTS SDARTS Content Summary for an OAI collection 9/21/2018 Columbia University Computer Science Dept.

No programming required for any of the tasks Conclusions SDARTS can now extract rich content summaries from: Local text and XML databases Remote web databases OAI-compliant collections SDARTS is now OAI-compliant SDARTS allows easy integration of any OAI collection into SDARTS SDARTS supports searching transparently over a wide range of heterogeneous collections No programming required for any of the tasks 9/21/2018 Columbia University Computer Science Dept.

We are on the Web :-) http://sdarts.cs.columbia.edu/ SDARTS executables and documentation SDARTS source code with documentation SDARTS web client SDARTS database selection module SDARTS-OAI interface tools Sample SDARTS-compliant databases http://sdarts.cs.columbia.edu/ 9/21/2018 Columbia University Computer Science Dept.