Panagiotis G. Ipeirotis Tom Barry Luis Gravano

Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiative Panagiotis G. Ipeirotis Tom Barry Luis Gravano Computer Science Dept., Columbia University

Metasearching? Why? “Surface” Web vs. “Hidden” Web
Link structure Crawlable “Hidden” Web Documents “hidden” in databases No link structure Search engines do not index them Need to query each collection individually 9/21/2018 Columbia University Computer Science Dept.

Metasearching Challenges
“Content summaries” of databases (frequencies of words) wireless: 2,000 network: 8,000 ... wireless: 0 network: 10 wireless: 5 network: 40 Select good databases for a given query Evaluate the query at these databases Merge the results from these databases Uniform interfaces Hidden Web Metasearcher Existing Web Database Non-indexed Documents Relational Database / Library / etc. 9/21/2018 Columbia University Computer Science Dept.

Outline Background: SDARTS, SDLIP, STARTS
Extracting content summaries from remote web databases Interfacing with Open Archives Initiative 9/21/2018 Columbia University Computer Science Dept.

SDARTS: SDLIP + STARTS NOT yet another protocol Metasearcher SDLIP
interfaces STARTS metadata S M S M S M grep cat select S M = Search = Metadata 9/21/2018 Columbia University Computer Science Dept.

STARTS: A Metasearching Protocol
Defines: Query language Results format Metadata for the collection Complements SDLIP for metasearching purposes Provides metadata for individual documents Provides content summaries for databases PubMed content summary number of documents = 3,868,552 … cancer  1,398,178 heart  281,506 hepatitis  23,481 basketball  907 9/21/2018 Columbia University Computer Science Dept.

Customization requires just editing configuration files,
SDARTS: The Toolkit SDARTS architecture makes new-wrapper implementation easy SDARTS toolkit includes reference implementations for common types of text databases: Local text databases Local XML databases Remote web databases Customization requires just editing configuration files, no programming 9/21/2018 Columbia University Computer Science Dept.

SDARTS Content Summaries
Detailed content summaries easily extracted from locally available (plain-text or XML) databases Detailed content summaries so far not available for remote web databases No access to full contents 9/21/2018 Columbia University Computer Science Dept.

Extracting Content Summaries from Remote Web Databases
No direct access to remote documents Resort to document sampling: Send queries to the database Retrieve a representative document sample Use the sample to create an approximation of the content summary Database selection algorithms work well even with approximate content summaries VLDB 2002 9/21/2018 Columbia University Computer Science Dept.

Topic-based Sampling: Training
Start with a predefined hierarchy and associated, pre-classified documents Train rule-based document classifiers for each node The output is a set of rules like: ibm AND computers → Computers lung AND cancer → Health … hepatitis AND liver → Hepatitis angina → Heart } Root } Health 9/21/2018 Columbia University Computer Science Dept.

Topic-based Sampling: Probing
Transform each rule into a query For each query: Send query to database Record number of matches Retrieve top-k documents for query At the end of the round: Analyze matches for each category Choose category to focus on The result is a representative document sample Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries to the database 9/21/2018 Columbia University Computer Science Dept.

Sample Contains “Relative” Word Frequencies
“Liver” appears in 200 out of 300 documents in sample “Kidney” appears in 100 out of 300 documents in sample “Hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? Query “liver” returned 140,000 matches Query “hepatitis” returned 20,000 matches “kidney” was not a query probe… Can exploit number of matches from one-word queries 9/21/2018 Columbia University Computer Science Dept.

Adjusting Document Frequencies
We know absolute document frequency f of words from one-word queries We know ranking r of words according to document frequency in sample Mandelbrot’s formula connects word frequency f and ranking r We use curve-fitting to estimate the absolute frequency of all words in sample 9/21/2018 Columbia University Computer Science Dept.

Implementing Content-Summary Extraction in SDARTS Toolkit
Implemented content-summary extraction module as J2EE-compliant servlet First, build SDARTS wrapper for remote web database Then, trigger extraction process to generate content summary automatically Module customizable with any classification scheme Toolkit provides 72-node hierarchical scheme and associated classifiers To add new scheme, should define the hierarchy and provide classifiers for the internal nodes 9/21/2018 Columbia University Computer Science Dept.

Fraction of PubMed Content Summary
number of documents = 3,868,552 … cancer  1,398,178 aids  106,512 heart  281,506 angina  26,775 hepatitis  23,481 basketball  907 cpu  487 Extracted automatically ~ 27,500 words in the extracted content summary Less than 200 queries sent Retrieved 4 documents per query The extracted content summary accurately represents size and contents of the database 9/21/2018 Columbia University Computer Science Dept.

Topic-based Sampling: Conclusions
SDARTS now supports extraction of detailed content summaries from any database, local or remote Sophisticated database selection algorithms can now be implemented on top of SDARTS Implemented and available for download: Database Selection Module SDARTS Client with Database Selection 9/21/2018 Columbia University Computer Science Dept.

Interfacing with Open Archives Initiative (OAI)
“No man is an island, entire of itself; every man is a piece of the continent, a part of the main...…” (John Donne) OAI Service Provider Export SDARTS metadata under OAI Access transparently any OAI collection through SDARTS SDARTS/SDLIP Server OAI Data Provider SDARTS Client 9/21/2018 Columbia University Computer Science Dept.

Exporting SDARTS Metadata under OAI
SDARTS supports detailed, record-level metadata for each document, for XML and plain-text collections Easy mapping to Dublin Core SDARTS also exports content summaries under OAI Each SDARTS collection is mapped to an OAI set We export the content summaries under OAI, as metadata about the set <PAPER> <TITLE>The threat of vancomycin resistance</TITLE> <AUTHORS>Trish M. Perl MD, MSc</AUTHORS> <FILENO>ajm_106_05_0489</FILENO> <APPEARED> <JRNL>American Journal of Medicine</JRNL> <VOL>106</VOL><ISS>5</ISS> <DATE>3 May </DATE> <YEAR>1999</YEAR> </APPEARED> <ABSTRACT> … </ABSTRACT> <BODY> … </BODY> </PAPER> COLUMBIA SDARTS Server PubMed Publications Aides Medical Collection NOAH: New York Online Access to Health Cardiovascular Institute of the South Columbia's DLI2 Medical Corpus Harrisons Online 9/21/2018 Columbia University Computer Science Dept.

SDARTS OAI Sever: Details
Uses OCLC OAI Server Uses MySQL –via JDBC– to store OAI records Records materialized after first request for space efficiency Distributed as WAR file Simple configuration: Specify SDARTS/MySQL address OAI Service Provider SDARTS OAI Interface JDBC SDARTS Server MySQL RDBMS 9/21/2018 Columbia University Computer Science Dept.

Searching OAI Collections
OAI is not designed for searching Possible to restrict only “Date” and “Set” Need to search OAI collections Users want to specify “Title”, “Author”, etc. OAI Service Provider Author = “F. Douglass” OAI Data Provider (e.g., Library of Congress ) User ? Author = “F. Douglass” 9/21/2018 Columbia University Computer Science Dept.

Harvesting and Searching OAI within SDARTS
OAI Data Provider (e.g., Library of Congress ) OAI exports metadata records in XML SDARTS can index and search XML collections Solution: Harvest OAI records (by “Date”, “Set”) Store records locally as XML documents Use SDARTS XML wrapper to index them Harvest OAI/XML records SDARTS/SDLIP Server Index OAI/XML records The OAI collection is searchable as an SDARTS XML database 9/21/2018 Columbia University Computer Science Dept.

Adding an OAI Collection in SDARTS
loc 9/21/2018 Columbia University Computer Science Dept.

Distributed Search over OAI
VT Electronic Thesis & Dissertation number of documents = 2,948 … study  1,479 thesis  493 cancer  13 basketball  2 SDARTS treats OAI collections as simple, local XML databases Exact content summaries are exported for OAI collections Possible to build sophisticated distributed search over OAI using SDARTS SDARTS Content Summary for an OAI collection 9/21/2018 Columbia University Computer Science Dept.

No programming required for any of the tasks
Conclusions SDARTS can now extract rich content summaries from: Local text and XML databases Remote web databases OAI-compliant collections SDARTS is now OAI-compliant SDARTS allows easy integration of any OAI collection into SDARTS SDARTS supports searching transparently over a wide range of heterogeneous collections No programming required for any of the tasks 9/21/2018 Columbia University Computer Science Dept.

We are on the Web :-) http://sdarts.cs.columbia.edu/
SDARTS executables and documentation SDARTS source code with documentation SDARTS web client SDARTS database selection module SDARTS-OAI interface tools Sample SDARTS-compliant databases 9/21/2018 Columbia University Computer Science Dept.

Panagiotis G. Ipeirotis Tom Barry Luis Gravano

Similar presentations

Presentation on theme: "Panagiotis G. Ipeirotis Tom Barry Luis Gravano"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Panagiotis G. Ipeirotis Tom Barry Luis Gravano

Similar presentations

Presentation on theme: "Panagiotis G. Ipeirotis Tom Barry Luis Gravano"— Presentation transcript:

Similar presentations

About project

Feedback