1 Arie Shoshani, LBNL SDM center Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

1 The SciDAC Scientific Data Management Center: Infrastructure and Results Arie Shoshani Lawrence Berkeley National Laboratory SC 2004 November, 2004.
Earth System Curator Spanning the Gap Between Models and Datasets.
Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
SDM center All-hands breakout session notes March 2002 Gatlinburg TN.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
SDM center Questions – Dave Nelson What kind of processing / queries / searches biologists do over microarray data? –Range query on a spot? –Range query.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July.
EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Summary of SDM ETC Kickoff for the Data Integration Task Terence Critchlow Calton Pu Ling Liu David Buttler Bertram Ludaescher Amarnath Gupta Mladen Vouk.
1 Towards Automating Complex Associative Access to Multiple Bioinformatics Data Sources Ling Liu, Calton Pu David Buttler, Wei Han Henrique Paques, Dan.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Data Mining and Access Pattern Discovery  Subprojects:  Dimension reduction and sampling.
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Arie Shoshani The Scientific Data Management Center Arie Shoshani (PI) Lawrence Berkeley National Laboratory DOE Laboratories.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Scientific Data Management (SDM)
SDM meeting, July 10-11, 2001Area 3 Report Data mining and discovery of access patterns 3a.i) Adaptive file caching in a distributed system (LBNL) 3b.i)
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.
1 Scientific Data Management Center DOE Laboratories: ANL: Rob Ross LBNL:Doron Rotem LLNL:Chandrika Kamath ORNL: Nagiza Samatova.
High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center(SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
The Globus Project: A Status Report Ian Foster Carl Kesselman
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
SDM Center’s Data Mining & Analysis SDM Center Parallel Statistical Analysis with RScaLAPACK Parallel, Remote & Interactive Visual Analysis with ASPECT.
1 Scientific Data Management Center(ISIC) contains extensive publication list.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division.
May 2003National Coastal Data Development Center Brief Introduction Two components Data Exchange Infrastructure (DEI) Spatial Data Model (SDM) Together,
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National Laboratory.
Your name here SPA: Successes, Status, and Future Directions Terence Critchlow And many, many, others Scientific Process Automation PNNL.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Ling Liu, Calton Pu GT Reagan Moore, Bertam Ludaescher, SDSC Amarnath Gupta.
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory.
GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Supercomputing 2006 Scientific Data Management Center Lead Institution: LBNL; PI: Arie Shoshani Laboratories: ANL, ORNL, LBNL, LLNL, PNNL Universities:
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
SDM Center Techniques for feature identification in scientific data Chandrika Kamath (LLNL) with Erick Cantú-Paz, Imola Fodor, Cyrus Harrison, Nicole Love,
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
1 Particle Physics Data Grid (PPDG) project Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99.
University of Chicago Department of Energy Applications In Hand:  FLASH (HDF-5)  ENZO (MPI-IO)  STAR Likely  Climate – Bill G to contact (Michalakas.
Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.
Scientific Data Management contains extensive publication list
SDM workshop Strawman report History and Progress and Goal.
Metadata Development in the Earth System Curator
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

1 Arie Shoshani, LBNL SDM center Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March 26-27,

2 Arie Shoshani, LBNL SDM center

3 Arie Shoshani, LBNL SDM center Original Goals and Framework coordinated framework for thecoordinated framework for the unification, development, deployment, and reuse of scientific data management software FrameworkFramework 4 areas (+ “glue”)  Very large, distributed, heterogeneous, data mining (+ agent technology) 4 tier levels  Storage, file, dataset, federated data

4 Arie Shoshani, LBNL SDM center Task Diagram 5) Agent technology c) Dataset Level b) File Level a) Storage Level 1) Storage and retrieval of Very large datasets 2) Access optimization of distributed data Parallel I/O: improving parallel access from clusters (ANL, NWU) MPI I/O: implementation based on file-level hints (ANL, NWU) Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) Knowledge-based federation of heterogeneous databases (SDSC) Low level API for grid I/O (ANL) Optimization of low-level data storage, retrieval and transport (ORNL) [Grid Enabling Technology] Analysis of application-level query patterns (LLNL, NWU) Optimizing shared access to tertiary storage (LBNL, ORNL) High-dimensional indexing techniques (LBNL) Enabling communication among tools and data (ORNL, NCSU) d) Dataset Federation Level Multi-agent high-dimensional cluster analysis (ORNL) Adaptive file caching in a distributed system (LBNL) Dimension reduction and sampling (LLNL, LBNL) 3) Data mining and discovery of access patterns 4) Distributed, heterogeneous data access

5 Arie Shoshani, LBNL SDM center Tapes Disks Scientific Simulations & experiments Scientific Data Management ISIC Scientific Analysis & Discovery Data Manipulation: Getting files from Tape archive Extracting subset of data from files Reformatting data Getting data from heterogeneous, distributed systems moving data over the network Petabytes Terabytes Tapes Disks Petabytes Terabytes Data Manipulation: ~80% time ~20% time ~20% time ~80% time Using SDM-ISIC technology Scientific Analysis & Discovery Climate Modeling Astrophysics Genomics and Proteomics High Energy Physics Optimizing shared access from mass storage systems Metadata and knowledge- based federations API for Grid I/O High-dimensional cluster analysis High-dimensional indexing Adaptive file caching Agents … SDM-ISIC Technology DOE Labs: ANL, LBNL, LLNL, ORNL Universities: GTech, NCSU, NWU, SDSC Current Goal Goals Optimize and simplify: access to very large datasets access to distributed data access of heterogeneous data data mining of very large datasets

6 Arie Shoshani, LBNL SDM center Benefits to Applications Efficiency Example: by removing I/O bottlenecks – matching storage structures to the application Effectiveness Example: by making access to data from tertiary storage or various sites on the data grid “transparent”, more effective data exploration is possible New algorithms Example: by developing a more effective high-dimensional clustering technique for large datasets, discovery of new correlations are possible Enabling ad-hoc exploration of data Example: by enabling a “run and render” capability to visualize simulation output while the code is running, it is possible to monitor and steer a long-running simulation

7 Arie Shoshani, LBNL SDM center How to execute plan? Executive CommitteeExecutive Committee Made of area leaders Organize into projectsOrganize into projects Led by area leaders Common theme Multiple tasks combine into common goal All tasks covered (some in more than one project) Initially focus on one primary application area (more better) Focus on one (or more) application scientists contacts Focus on specific scenarios that represent real needs Conference callsConference calls Every Monday Cycle on Project P1-P4 Open to all (Arie & Ekow attend all) Quarterly reportsQuarterly reports Half yearly all-handsHalf yearly all-hands

8 Arie Shoshani, LBNL SDM center Organization of Projects: P1, P2, P3, P4 5) Agent technology c) Dataset Level b) File Level a) Storage Level 1) Storage and retrieval of Very large datasets 2) Access optimization of distributed data Parallel I/O: improving parallel access from clusters (ANL, NWU) MPI I/O: implementation based on file-level hints (ANL, NWU) Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) Knowledge-based federation of heterogeneous databases (SDSC) Low level API for grid I/O (ANL) Optimization of low-level data storage, retrieval and transport (ORNL) [Grid Enabling Technology] Analysis of application-level query patterns (LLNL, NWU) Optimizing shared access to tertiary storage (LBNL, ORNL) High-dimensional indexing techniques (LBNL) Enabling communication among tools and data (ORNL, NCSU) d) Dataset Federation Level Multi-agent high-dimensional cluster analysis (ORNL) Adaptive file caching in a distributed system (LBNL) Dimension reduction and sampling (LLNL, LBNL) 3) Data mining and discovery of access patterns 4) Distributed, heterogeneous data access

9 Arie Shoshani, LBNL SDM center Projects and Primary Application Areas Organized ourselves into 4 projectsOrganized ourselves into 4 projects (P1) Heterogeneous Data Integration (biology)  LLNL, SDSC, GATECH, NCSU, ORNL (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics)  LLNL, ORNL, LBNL (P3) Efficient Access from Large Datasets (HENP, Combustion)  LBNL, ORNL (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate)  ANL, NWU, LLNL SDM center

10 Arie Shoshani, LBNL SDM center Projects and Primary Application Areas Organized ourselves into 4 projectsOrganized ourselves into 4 projects (P1) Heterogeneous Data Integration (biology)  LLNL - Terence  SDSC – Amarnath, Bertram, Ilkay  GATECH – Ling, Calton + students  NCSU – Mladen + Students  ORNL – Tom (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics)  LLNL – Chandrika, Ghaleb, Imola  ORNL – Nagiza, George, Tom  LBNL – Ekow SDM center

11 Arie Shoshani, LBNL SDM center Projects and Primary Application Areas Organized ourselves into 4 projectsOrganized ourselves into 4 projects (P3) Efficient Access from Large Datasets (HENP, Combustion)  LBNL – John, Ekow, Arie + postdoc  ORNL – Randy, Dan (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate)  ANL – Bill, Rob, Rajiv  NWU – Alok, Wei-Kang + students  LLNL – Ghaleb Area leader at Large  Tom SDM center

12 Arie Shoshani, LBNL SDM center Focus on real needs Selected specific short term goals & scenariosSelected specific short term goals & scenarios (P1) Heterogeneous Data Integration (biology)  Microarray analysis workflow scenario (P2) Data Mining and Access Pattern Discovery (Climate, Astrophysics)  “Run and Render” scenario for Astrophysics  Dimensionality reduction for Climate model (P3) Efficient Access from Large Datasets (HENP)  STAR analysis framework (P4) Parallel Disk Access & Grid-IO (Astrophysics, Climate)  FLASH codes for Astrophysics  NetCDF using MPI-IO for Climate Modeling & Fusion SDM center

13 Arie Shoshani, LBNL SDM center Application Scientists Contacts Close collaboration with individualsClose collaboration with individuals Matt Coleman - LLNL (Biology) Tony Mezzacappa – ORNL (Astrophysics) Ben Santer - LLNL, John Drake - ORNL (Climate) Doug Olson - LBNL, Wei-Ming Zhang – Kent (HENP) Wendy Koegler – Sandia L. (Combustion) Mike Papka - ANL (Astrophysics Vis) Mike Zingale – U of Chicago (Astrophysics) John Michalakes – NCAR (Climate) SDM center

14 Arie Shoshani, LBNL SDM center Organization of Meeting First dayFirst day Applications perspective on data management needs  Explain why the need  Say what hurts the most Technical details of current work and existing software  By project  Talks led by Area Leaders Second daySecond day Discuss and develop plans – 4 breakout sessions  Specific technical goals in next half year  SDM-ISIC people involved  Application people involved  Estimated schedule  Longer term projections (2-3 years)  Identify potential new applications – future focus Planning  Conference calls – reporting  Intellectual property  CVS repositories  Future all-hands, September

15 Arie Shoshani, LBNL SDM center Agenda - Morning Day 1, March 26 8:00 Introduction and opening remarks Arie Shoshani 8:15 Comments by DOE Program Manager John Van Rosendale 8:30 Astrophysics Perspective Tony Mezzacappa, ORNL 9:15 Climate Perspective John Drake, ORNL 10:00 –10:15 Break 10:15 HEP Perspective Doug Olson, LBNL 11:00 Biology Perspective Dave Nelson, LLNL 11:45 Putting software into production Randy Burris, ORNL 12:00 Lunch

16 Arie Shoshani, LBNL SDM center Agenda – Afternoon 1:00 PM (P1) Heterorgeneous Data Access Area Leader: Terence Critchlow - Supporting Heterogeneous Data Access in Genomics Presenter: Terence Critchlow -Context-sensitive Service Composition for Support of Scientific Workflows Presenter: Mladen A. Vouk - XWRAPComposer: A wrapper generation system for Integrating Bioinformatics Data Sources Presenter: Ling Liu - Constructing Workflows by Integrating Interactive Information Sources Presenters: Amarnath Gupta & Ilkay Altintas 2:00 PM P2) Data Mining and Access Pattern Discovery Area Leader: Nagiza Samatova - ASPECT: Adaptable Simulation Product Exploration and Control Toolkit presenter: Nagiza Samatova - Dimension Reduction and Sampling presenter: Imola Fodor - Discovery of Access Patterns to Scientific Simulation Data presenter: Ghaleb Abdulla 3:30 PM (P3) Efficient Access from Large Datasets area Leader: Arie ShoshanI - Supporting Ad-hoc Data Exploration for Large Scientific Databases presenter: Arie Shoshani - Efficient Bitmap Indexing Techniques for Very Large Datasets presenter: John Wu - Shared Disk File Caching Taking into Account Delays in Space Reservations, Transfer, and processing presenter: Ekow Otoo - Optimizing Shared Access to Tertiary Storage presenter: Randy Burris 4:30 PM (P4) Parallel Disk Access & Grid-IO Area Leaders: Bill Gropp and Alok Choudhary - Parallel and Grid I/O Infrastructure presenter: Rob Ross - Enabling High Performance Application I/O presenter: Wei-keng Liao 5:30 Comments from application people (1 hour) (free form discussion)

17 Arie Shoshani, LBNL SDM center Agenda – Day 2 8:00 Welcome and logistics 8:30 Recap and planning 9:30 Project Breakout meetings (2 Hours) Specific technical goals in next half year SDM-ISIC people involved Application people involved Estimated schedule Longer term projections (2-3 years) Identify potential new applications – future focus Lunch 1:00 Project breakout meetings (2 Hours) 3:00 Summary of meetings (2 Hour) (30 min per project) 5:00 Conclusion and planning