Summary of SDM ETC Kickoff for the Data Integration Task Terence Critchlow Calton Pu Ling Liu David Buttler Bertram Ludaescher Amarnath Gupta Mladen Vouk.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

IRRA DSpace April 2006 Claire Knowles University of Edinburgh.
1 ICS-FORTH Dimitris Plexousakis, Pisa, February 2001 The CYCLADES Mediator Service Dimitris Plexousakis Computer Science Department, University.
SDM center All-hands breakout session notes March 2002 Gatlinburg TN.
Multi-Mode Survey Management An Approach to Addressing its Challenges
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
NCSU-1V1/26-Mar-021 Context-sensitive Service Composition for Support of Scientific Workflows Mladen A. Vouk North Carolina State University, Raleigh,
1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
DataFoundry: An Approach to Scientific Data Integration Terence Critchlow Ron Musick Ida Lozares Center for Applied Scientific Computing Tom SlezakKrzystof.
Archives and Information Retrieval
1 SWE Introduction to Software Engineering Lecture 22 – Architectural Design (Chapter 13)
Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
UNIT-V The MVC architecture and Struts Framework.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Modeling Interactive Web Sources for Information Mediation Information Mediation Framework/Motivation Modeling Interactive Sources with Interaction Diagrams.
2003 April 151 Data Centres: Connecting to the Real World Clive Page.
4-1 INTERNET DATABASE CONNECTOR Colorado Technical University IT420 Tim Peterson.
C Copyright © 2009, Oracle. All rights reserved. Appendix C: Service-Oriented Architectures.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
SDM meeting, July 10-11, 2001Area 3 Report Data mining and discovery of access patterns 3a.i) Adaptive file caching in a distributed system (LBNL) 3b.i)
Rahul Raman, Ram Sasisekharan Bioinformatics Core Massachusetts Institute of Technology Glue Grants Bioinformatics Meeting April 22-23, 2004 San Diego,
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
 Chapter 6 Architecture 1. What is Architecture?  Overall Structure of system  First Stage in Design process 2.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
San Diego Supercomputer Center University of California, San Diego The MIX Project Native XML Database XML View(s) Wrappers export: 1. Schemas & Metadata.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center(SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Using and modifying plan constraints in Constable Jim Blythe and Yolanda Gil Temple project USC Information Sciences Institute
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Ling Liu, Calton Pu GT Reagan Moore, Bertam Ludaescher, SDSC Amarnath Gupta.
Information Retrieval
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
EnVisioning Data Integration SME forum 2009, Vienna Henning Hermjakob Henning Hermjakob
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
ECHO Technical Interchange Meeting 2013 Timothy Goff 1 Raytheon EED Program | ECHO Technical Interchange 2013.
OGSA-DQP Steven Lynden University of Manchester. Data access & integration with OGSA-DAI: GGF 17 2 Introduction OGSA-DQP is a service based distributed.
UCSD Neuron-Centered Database
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
MANAGING DATA RESOURCES
Lesson 3 Bioinformatics Laboratory
TargetDB and PEPCDB •
Supporting High-Performance Data Processing on Flat-Files
Presentation transcript:

Summary of SDM ETC Kickoff for the Data Integration Task Terence Critchlow Calton Pu Ling Liu David Buttler Bertram Ludaescher Amarnath Gupta Mladen Vouk Tom Potok

People involved: People l Terence Critchlow (LLNL) l Calton Pu(GT) l Ling Liu(GT) l David Buttler(GT) l Bertram Ludaescher (UCSD) l Amarnath Gupta(UCSD) l TDB:  Ph.D. student at Georgia Tech  Developer at UCSD l Mladen Vouk / Tom Potok NCSU / ORNL Commitment per institution LLNL  0.25 (likely) – 1.0 FTE Georgia Tech  2 Ph.D. Students  X months Calton’s time  Y months Ling’s time UCSD  1 FTE  1 month Bertram’s time  1 month Gupta’s time Agent team  2-4 months over the course of the year

Application ties l Primary domain: bioinformatics l Secondary domains:  Material science  Air / water quality l Scientists (early adopters)  Matt Coleman(LLNL)  Allen Christian(LLNL)  Phil Bourn(PDB) Contacted by Terence Contacted by Bertram / Gupta

Use Case 1: Finding out everything about a sequence l Bob starts with one or several DNA or protein sequences that he wants to analyze  OR: Bob finds protein or gene sequences of interest by querying databases/web sites for metabolic pathways/cell signaling pathways (e.g., KEGG);  OR: Bob looks at a database of microarray experiments and chooses those genes that exhibit specified patterns of co-occurrence (what subsets of genes “go hand in hand” across a large number of experiments) l The relevant sequences are submitted to one or more sequence databases for blast search l The homologous sequences found in the searched database(s) are  directly returned to the user, sorted by score  OR: post-processed by the mediator (duplicate elimination, groupings, links to additional contextual data) l The resulting sequences can be queried for their associated information l Bob can use these sequences for new similarity searches

Use Case 1: Additional scenerios l Helpful features for users  Multiple sequences entered through a single file  Ability to tie in other programs to preprocess data before passing it to wrappers / mediator l Follow-up searches may be more than just blasts  Selection / project / join queries through the interface  Tie in other tools such as RasMol  Other types of search such as phiblast, psiblast or other structural similarity searches

Data Integration Architecture df PDB XML Wrapper XML Wrapper VIPAR XML Wrapper API Integration component / KB-Mediator (KBM) Query Dispatch and Collection (QDaC) CM Wrapper CM Wrapper CM Wrapper Source / Agent MetaData Registry XWRAP Wrapper Generator XQuery (subsets e.g. Sel/Proj) : Medline XML Wrapper External Program XQuery interface Select/project only if invoked, pre-processes query parameters and post-processes results

Architecture comments l Communication protocol:  Use agent technology to communicate between components  Don’t use full capabilities when on the same machine  Between QDaC and wrappers, QDaC and mediator, mediator and CMs, CMs and wrappers  NOT expected between wrappers and source l Embedded representation:  XML sources are queried using a subset of XQuery (fragments)  Primarily concerned with selection and projection – not join  Query results are returned in XML

Architecture comments l Meta-data repository (=metadata server)  Contains:  Location, schema  Query capabilities (blast, keyword, XPath) of sources  May be duplicated / shared between QDaC and KBM  Eventually may be treated as an agent l External programs  Will be included as preprocessing steps  May need wrappers to handle translations properly  Will be tied in to interface where possible  Gives users access to tools they need / want / are familiar with

Architecture comments l Expect most wrappers to be generated by XWrap in practice, but it shouldn’t matter as long as they follow the specified protocol and representation  VIPAR used to wrap publication sources  Simple SQL wrapper for direct database access l Definitions:  CM – conceptual mapping: a wrapper that translates source-specific XML into

Year 1 deliverables l Send XQuery command to BLAST sources, combine results, and return to user interface l Interact with at least 4 sources  Integration component will have at least 2 sources  QDaC will directly query NCBI and at least one other l Operate QDaC and mediator in a distributed environment  Interface / QDaC at LLNL and mediator at UCSD Have agent stubs at UCSD and LLNL passing text strings within 3 months

Detailed tasks 1. Interface (LLNL) A. Extended to handle blast against new sources  Some of which are not integrated 2. QDaC (LLNL) A. Identify available wrappers from meta-data  This includes the SDSC component B. Query wrappers using XQuery C. Collect and sort responses D. Adopt agent protocol

Detailed tasks 3. XWrap (GT) A. Accept XPath/XQuery input B. Handle complex BLAST interfaces C. Adopt agent protocol 4. Mediator (UCSD) A. Model of pathways, gene and protein expressions ==> ontology to be used for driving BLAST queries and interpreting their results B. Accept XQuery queries C. Identify available sources from meta-data D. Modify CM wrappers to generate XQuery commands 5. Agent technology (ORNL, LLNL, UCSD) A. Use VIPAR to wrap Medline database B. Use protocols to communicate between LLNL and SDSC components

Administrative l Reports  Quarterly reports  to be collected by Terence, (possibly) summarized, and forwarded on to Arie  Short – bulleted form (word file or plain text preferred) l Center-wide communications  Telecon 1 st Monday of the month 11:00 – 12:00 PST  It is ok to miss this  Semi-annual meetings  next at ORNL in mid-March  Center web site will point to individual task sites  Shared CVS repository at NC State  Primarily for major releases / sharing code between tasks

Administrative l Advisory committee  Potential names from bioinformatics area  Carole Goble (Univ of Manchester), Tom Slezak (LLNL), ???  Unclear who pays travel for members  This is for us, so they will not be generating reports

Task specific l Mail list  For our task ONLY is being set  Will be archived l Site contacts  Terence (LLNL)  Bertram (UCSD)  Calton (GT)  Tom (Agents) l Web site  Being set up at GT l Use main CVS repository for major releases l Code sharing option 1  Task-only CVS repository for day-to-day work  Unlikely LLNL could host this service l Code sharing option 2  Site specific cvs repositories for day-to-day work  Alexandria repository for inter-task code sharing  casc.llnl.gov/alexandria/ casc.llnl.gov/alexandria/  Disadv: tar-balls  Adv: we don’t all need an account on the repository machine