Download presentation
Presentation is loading. Please wait.
1
Summary of SDM ETC Kickoff for the Data Integration Task Terence Critchlow Calton Pu Ling Liu David Buttler Bertram Ludaescher Amarnath Gupta Mladen Vouk Tom Potok
2
People involved: People l Terence Critchlow (LLNL) l Calton Pu(GT) l Ling Liu(GT) l David Buttler(GT) l Bertram Ludaescher (UCSD) l Amarnath Gupta(UCSD) l TDB: Ph.D. student at Georgia Tech Developer at UCSD l Mladen Vouk / Tom Potok NCSU / ORNL Commitment per institution LLNL 0.25 (likely) – 1.0 FTE Georgia Tech 2 Ph.D. Students X months Calton’s time Y months Ling’s time UCSD 1 FTE 1 month Bertram’s time 1 month Gupta’s time Agent team 2-4 months over the course of the year
3
Application ties l Primary domain: bioinformatics l Secondary domains: Material science Air / water quality l Scientists (early adopters) Matt Coleman(LLNL) Allen Christian(LLNL) Phil Bourn(PDB) Contacted by Terence Contacted by Bertram / Gupta
4
Use Case 1: Finding out everything about a sequence l Bob starts with one or several DNA or protein sequences that he wants to analyze OR: Bob finds protein or gene sequences of interest by querying databases/web sites for metabolic pathways/cell signaling pathways (e.g., KEGG); OR: Bob looks at a database of microarray experiments and chooses those genes that exhibit specified patterns of co-occurrence (what subsets of genes “go hand in hand” across a large number of experiments) l The relevant sequences are submitted to one or more sequence databases for blast search l The homologous sequences found in the searched database(s) are directly returned to the user, sorted by score OR: post-processed by the mediator (duplicate elimination, groupings, links to additional contextual data) l The resulting sequences can be queried for their associated information l Bob can use these sequences for new similarity searches
5
Use Case 1: Additional scenerios l Helpful features for users Multiple sequences entered through a single file Ability to tie in other programs to preprocess data before passing it to wrappers / mediator l Follow-up searches may be more than just blasts Selection / project / join queries through the interface Tie in other tools such as RasMol Other types of search such as phiblast, psiblast or other structural similarity searches
6
Data Integration Architecture df PDB XML Wrapper XML Wrapper VIPAR XML Wrapper API Integration component / KB-Mediator (KBM) Query Dispatch and Collection (QDaC) CM Wrapper CM Wrapper CM Wrapper Source / Agent MetaData Registry XWRAP Wrapper Generator XQuery (subsets e.g. Sel/Proj) : Medline XML Wrapper External Program XQuery interface Select/project only if invoked, pre-processes query parameters and post-processes results
7
Architecture comments l Communication protocol: Use agent technology to communicate between components Don’t use full capabilities when on the same machine Between QDaC and wrappers, QDaC and mediator, mediator and CMs, CMs and wrappers NOT expected between wrappers and source l Embedded representation: XML sources are queried using a subset of XQuery (fragments) Primarily concerned with selection and projection – not join Query results are returned in XML
8
Architecture comments l Meta-data repository (=metadata server) Contains: Location, schema Query capabilities (blast, keyword, XPath) of sources May be duplicated / shared between QDaC and KBM Eventually may be treated as an agent l External programs Will be included as preprocessing steps May need wrappers to handle translations properly Will be tied in to interface where possible Gives users access to tools they need / want / are familiar with
9
Architecture comments l Expect most wrappers to be generated by XWrap in practice, but it shouldn’t matter as long as they follow the specified protocol and representation VIPAR used to wrap publication sources Simple SQL wrapper for direct database access l Definitions: CM – conceptual mapping: a wrapper that translates source-specific XML into
10
Year 1 deliverables l Send XQuery command to BLAST sources, combine results, and return to user interface l Interact with at least 4 sources Integration component will have at least 2 sources QDaC will directly query NCBI and at least one other l Operate QDaC and mediator in a distributed environment Interface / QDaC at LLNL and mediator at UCSD Have agent stubs at UCSD and LLNL passing text strings within 3 months
11
Detailed tasks 1. Interface (LLNL) A. Extended to handle blast against new sources Some of which are not integrated 2. QDaC (LLNL) A. Identify available wrappers from meta-data This includes the SDSC component B. Query wrappers using XQuery C. Collect and sort responses D. Adopt agent protocol
12
Detailed tasks 3. XWrap (GT) A. Accept XPath/XQuery input B. Handle complex BLAST interfaces C. Adopt agent protocol 4. Mediator (UCSD) A. Model of pathways, gene and protein expressions ==> ontology to be used for driving BLAST queries and interpreting their results B. Accept XQuery queries C. Identify available sources from meta-data D. Modify CM wrappers to generate XQuery commands 5. Agent technology (ORNL, LLNL, UCSD) A. Use VIPAR to wrap Medline database B. Use protocols to communicate between LLNL and SDSC components
13
Administrative l Reports Quarterly reports to be collected by Terence, (possibly) summarized, and forwarded on to Arie Short – bulleted form (word file or plain text preferred) l Center-wide communications Telecon 1 st Monday of the month 11:00 – 12:00 PST It is ok to miss this Semi-annual meetings next at ORNL in mid-March Center web site will point to individual task sites Shared CVS repository at NC State Primarily for major releases / sharing code between tasks
14
Administrative l Advisory committee Potential names from bioinformatics area Carole Goble (Univ of Manchester), Tom Slezak (LLNL), ??? Unclear who pays travel for members This is for us, so they will not be generating reports
15
Task specific l Mail list For our task ONLY sdmctr-integrate@llnl.gov is being set updmctr-integrate@llnl.gov Will be archived l Site contacts Terence (LLNL) Bertram (UCSD) Calton (GT) Tom (Agents) l Web site Being set up at GT l Use main CVS repository for major releases l Code sharing option 1 Task-only CVS repository for day-to-day work Unlikely LLNL could host this service l Code sharing option 2 Site specific cvs repositories for day-to-day work Alexandria repository for inter-task code sharing https://www- casc.llnl.gov/alexandria/ https://www- casc.llnl.gov/alexandria/ Disadv: tar-balls Adv: we don’t all need an account on the repository machine
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.