1 Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru San Diego Supercomputer Center.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Peter Berrisford RAL – Data Management Group SRB Services.
Welcome to Middleware Joseph Amrithraj
Database Architectures and the Web
SACNAS, Sept 29-Oct 1, 2005, Denver, CO What is Cyberinfrastructure? The Computer Science Perspective Dr. Chaitan Baru Project Director, The Geosciences.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
1 Cyberinfrastructure Summer Institute for Geoscientists August 14-18, 2006 San Diego Supercomputer Center.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Development of a Community Hydrologic Information System Jeffery S. Horsburgh Utah State University David G. Tarboton Utah State University.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
1 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Global Earth Observation Grid Workshop, Bangkok, Thailand, March Integration Platform.
GIS at SDSC Domains: –From geology, environmental science, hydrology, ocean biodiversity, regional development, Katrina response, archaeology, to neuroscience.
SAN DIEGO SUPERCOMPUTER CENTER Developing a CUAHSI HIS Data Node, as part of Cyberinfrastructure for the Hydrologic Sciences David Valentine Ilya Zaslavsky.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
January, 23, 2006 Ilkay Altintas
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
1 Chapter 3 Database Architecture and the Web Pearson Education © 2009.
C Copyright © 2009, Oracle. All rights reserved. Appendix C: Service-Oriented Architectures.
GEON Science Application Demos
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
GEON meeting - May 22, 2006 GAMA 2.0 Features and Status Kurt Mueller SDSC.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON Systems Report Karan Bhatia San Diego Supercomputer Center Friday Aug
GEON PI Meeting, March h, 2004, Blacksburg, VA C YBERINFRASTRUCTURE FOR THE G EOSCIENCES GEON IT Update PI Meeting, Blacksburg, VA March 21-23, 2004.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
GEOSCIENCE NEEDS & CHALLENGES Dogan Seber San Diego Supercomputer Center University of California, San Diego, USA.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON IT Advances: Overview Chaitan Baru San Diego Supercomputer Center.
1 G52IWS: Web Services Chris Greenhalgh. 2 Contents The World Wide Web Web Services example scenario Motivations Basic Operational Model Supporting standards.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
Web Services An Introduction Copyright © Curt Hill.
CUAHSI HIS: Science Challenges Linking small integrated research sites (
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
SDSC Storage Resource Broker & Meta-data Catalog SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX Application.
GEONSearch: From Searching to Recommending GeoInformatics 2006 May 10-12, Reston, Virginia Ullas Nambiar, Bertram Ludaescher Dept. of Computer Science.
Glossary WMS – OGC Web Mapping Services WFS – OGC Web Feature Services XML- Extensible Markup Language OGC – Open GIS Consortium ADN –
CSIG 09 Cyberinfrastructure Summer Institute for Geoscientists August 10-14, 2009 San Diego 1.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
All Hands Meeting 2005 BIRN-CC: Building, Maintaining and Maturing a National Information Infrastructure to Enable and Advance Biomedical Research.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
A Science Collaboration Environment for the Network for Earthquake Engineering Simulation (NEES) Choonhan Youn Chaitan Baru, Ahmed Elgamal,
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
VORB Virtual Object Ring Buffers
The Anatomy and The Physiology of the Grid
The Anatomy and The Physiology of the Grid
Presentation transcript:

1 Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru San Diego Supercomputer Center

2 Hardware Integrated Cyberinfrastructure System Source: Dr. Deborah Crawford, Chair, NSF CI Working Committee Middleware Services Development Tools & Libraries Applications Geosciences Environmental Sciences Neurosciences High Energy Physics … Domain-specific Cybertools (software) Shared Cybertools (software) Distributed Resources (computation, storage, communication, etc.) Education and Training Discovery & Innovation

3 Community Cyberinfrastructure Projects Middleware Services Development Tools & Libraries Distributed Computing, Instruments and Data Resources Friendly Work-Facilitating Portals Authentication - Authorization - Auditing - Workflows - Visualization - Analysis Biomedical Informatics (BIRN) High Enegy Physics (GriPhyN) Geosciences (GEON) Ecological Observatories (NEON) Earthquake Engineering (NEES) Ocean Observing (ORION) Hardware Adapted from: Prof. Mark Ellisman, UC San Diego Shared Tools Science Domains Your Specific Tools & User Apps.

4 Data, Tools, & Computation Data –Field observations –Laboratory analyses –Sensor-based data (land, airborne, satellite) Tools –QA/QC, simple transformations and analyses –Complex models Computation –Community codes –Access to high-performance computing –Data Intensive Computing

5 Variety of Geoinformatics Efforts Data collection –Digital data collection in the field –“When does it become cyberinfrastructure”? Database curation –E.g. EarthChem, Paleobiology, MorphoBank, Paleo Pollen, etc…. –When does it become “tools” and “community codes” Software Development –Tools: gravity and magnetics, paleogeography, geochemistry, seismic data products, … –Community codes: SCEC-CME, CIG, …

6 Variety of Geoinformatics Efforts High Performance Computing –LiDAR data management –Seismic analyses –Petascale initiative Data Integration –E.g. CUAHSI HIS –Also, a pressing need in projects like EarthScope

7 Cyberinfrastructure To provide access to all of these “resources” and support “interoperability” among them Cyberinfrastructure: The Common Platform Across Distributed Projects Data Collection Data Management And Curation Tool Development Modeling and Integration

8 Example: USArray Data Flow Deploy field sensor arrays –Across US Collect data from sensor arrays and perform QA/QC –One of the sites is SIO, San Diego Archive data for community access –IRIS, Seattle EarthScope/USArray: Single project, multiple participants.

9 D. Harding, NASA Point Cloud x, y, z, … Example: LiDAR Workflow Courtesy: Chris Crosby, ASU Survey Analyze / “Do Science” Interpolate / Grid Single goal: Multiple projects, multiple participants, e.g. NCALM, GEON, ASU, NASA, USGS, …

10 GEON Cyberinfrastructure Funded by NSF IT Research program Multi-institution collaboration between IT and Earth Science researchers GEON Cyberinfrastructure provides: –Authenticated access to data and Web services –Registration of data sets, tools, and services with metadata –Search for data, tools, and services, using ontologies –Scientific workflow environment and access to HPC –Data and map integration capability –Scientific data visualization and GIS mapping

11 Key Informatics Areas Portals –Authenticated, role-based access to cyber resources: data, tools, models, model outputs, collaboration spaces, … Data Integration –Search, discovery and integration of data from heterogeneous information sources (“mediation” and “semantic integration”) Use of workflow systems, and access to HPC –Ability to “program” at a higher level of abstraction –Sharing of models, along with “provenance” information –Gateways to HPC environments Management of Geospatial Information –Using GIS capabilities, map services, geospatial data integration Visualization of 3D, 4D geospatial data and information

12 Distributed System Definition A Distributed System is –one in which the hardware and software components in networked computers communicate and coordinate their activities only by passing messages, e.g. the Internet A Distributed Database System is –one in which data is stored at several sites, each managed by a database system (DBMS) that can run independently

13 Distributed System Models Client – Server Client A Client B Server 1 Client C Network invocation response Process 1 Process 3 Process 2 Network Peer to Peer

14 Remote Service Invocation TCP/IP –Basic Internet protocol for computer communications –Platform for building a number of other open or proprietary, “higher-level” communications protocols Communication at a higher-level of abstraction http –Open protocol based on TCP/IP for the Web –Fixed set of “verbs” (actions) used to transfer HTML documents CORBA, Java RMI –Protocols based on an object model

15 SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX User Dublin Core Resource, Mthd, User User Defined Application Meta-data Remote Proxies DataCutter Metadata Extraction C, C++, Linux I/O Unix Shell Java, NT Browsers Web Prolog Predicate MCAT SDSC Storage Resource Broker “Virtualizing” storage

16 SRB Client/Server Model SRB Client Network SRB Server Networ k SRB Server B SRB peer-to- peer protocol Oracle Server Oracle Client Networ k HPSS Client HPSS server Data are requested using an SRB ID and a “file abstraction” (open, close, read, write)

17 OpenDAP Client/Server model OpenDAP Clients Network OpenDAP Servers

18 OpenDAP From: Peter Cornillon & Jim Gallagher Data Matlab HDF4JDBC FreeFromFITS CDFCEDAR Data netCDF HDF4Matlab Data DSP Data JGOFS TablesSQLFITSCDF Flat Binary CEDAR Data CODAR Data ESML GeneralCODAR Servers netCDF C netCDF Java IDVFerret GrADS VisADncBrowse MatlabExcelIDL Access Matlab Client IDL Client Clients

19 Data are requested with a URL. Protocol Machine name OPeNDAP server Directory File name ?sst[10:10][0:90][0:180] Constraint User can impose a constraint on the data to be acquired from a data set by appending a constraint expression to the end of the URL OpenDAP Data Request

20 Remote Service Invocation with Web Services A Web Service is a simple protocol for invoking remote services on the Web. It is: –A network “endpoint”, i.e. server, that implements one or more “ports”. `Each port is defined by the message types that accepts and the messages it returns. –Specified by a “Web Service Definition Language” xml document. Given the WSDL for a web service you know all you need to interact with it. Web Service Standards also exist for security, policy, reliability, addressing, notification, choreography and workflow. –It is the basis for MS.NET, IBM Websphere, SUN, Oracle, BEA, HP, … –It is the basis for the new Grid standards like WSRF and OGSA.

21 Web Site vs Web Service From: “ Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 Web Site –Designed to pass http get/post/put request to between a browser and a web server. –Google has a web site. Web Service –Designed for services to talk to other services by exchanging xml messages –Google also provides a web service so Google may be used in distributed apps Client’s Browser Web Server Web Server Web Service Web Service Web Service Web Service Web Service Web Service

22 Grid Services From: “ Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 Grid: A distributed, heterogeneous set of resources –Integrated by a pervasive layer of services –Goal: allow users to view it as a single system More than the Internet (which forms part of the resource layer) Builds on the Web by building on web services Security Data Management Service Data Management Service Accounting Service Accounting Service Logging Event Service Policy Administration & Monitoring Administration & Monitoring Grid Orchestration Registries and Name binding Registries and Name binding Reservations And Scheduling Reservations And Scheduling Open Grid Service Architecture Layer Web Services Resource Framework – Web Services Notification Physical Resource Layer

23 Access Interfaces and Levels of Access Web service, native application program interface, ODBC/JDBC, filesystem filesystem DBMS Web Server “stack” SOAP server stack Application Program Mount remote filesystems Expose ODBC/JDBC interface (and full SQL) URLs and http WSDL and SOAP Application can also be “wrapped” as a Web Service SRB, OpenDAP, etc…

24 Authentication Client – Server models Client A Server 1 Network User Client-side authentication Server-side authentication Server 2 Server 3 ? ?

25 Common Authentication Certificate Authority Client Obtain Credentials Server 1 Invoke with Credentials Verify Credentials Server 2Server 3

26 Portal server 2 Grid Account Management Architecture (GAMA): Single sign-on in GEON (also used in a number of other projects) Karan Bhatia, Kurt Mueller, Choonhan Youn, Sandeep Chandra Portal server 1 GAMA server CACL MyproxyCAS OGSA Grid services wrapper … Servlet container import user retrieve credential Stand-alone applications retrieve credential DB gridportlets Java keystore gama GridSphere Servlet container create user

27 Systems Issues Load Balancing, Failover, Replication Client Server 1 Server 2 Server 3 Multiple servers for load balancing, failover Data replication

28 Distributed Data Access What is the issue? Ability to access data stored in multiple, different databases using a single request, e.g. –Get geologic information from multiple geologic databases –Get employee information from all branches Ability to update data stored in multiple databases, e.g. –Transfer salary amount from University to my bank account –Transfer funds from Visa account to vendor’s account

29 Distributed data access Client Database 1Database 2Database 3 Homogeneous: mySQL mySQL mySQL Heterogeneous: mySQL Oracle DB2 How about creating a “cached” local copy? mySQLExcelASCII flat file Sources may be data repositories or metadata catalogs

30 Data Warehousing Client Data Source 1 Data Source 2Data Source 3 Data Warehouse (common schema) ETL – Extract – Transform – Load ETL 1. Load data from sources to warehouse 2. Query processing interaction only between client and warehouse But, warehouse data could be “stale”, i.e. out of synch with source data…

31 Data integration via middleware Client Database 1 Database 2Database 3 Data integration Middleware (aka Mediator) 1. Each client request goes to sources, via middleware 2. Result collected by middleware and returned to client

32 Warehousing vs Mediation Warehousing: User ETL to “massage” local data to fit into a common global, warehouse schema Mediation: Modify user query to match schemas exported by each source –But, which schema does the user query? –The Integrated View Schema –Sources “export” a view (the export schema) Federated databases –Local sources belong to different “administrative domains”, i.e. different owners. –Local autonomy

33 The Canonical Mediator / Wrapper Architecture Client Application Wrapper Mediator (Integrated view in mediator data model, e.g. relational, XML) Local view in local data model Export view in mediator data model Q1Q1 Q 11 Q 12 Q 13 Q 14 Cached data Wrapper processes could execute at sources, at mediator, or elsewhere q 14 Data source 1 Local schema Data source 2 Local schema Data source 3 Local schema Data source 4 Local schema

34 Example: A Relational Mediator Client Application Mediator (Relational data model) Wrapper Relational DBMS e.g. PostGIS Shape file

35 Example: A Shape-file Based Mediator Client Application Mediator (Shape file-based data model) Wrapper Relational DBMS e.g. PostGIS Shape file

36 Example: An XML Mediator User / Applications Mediator (XML-based data model, e.g. GML) Wrapper Relational DBMS e.g. PostGIS Shape file Wrapper XML file e.g. ArcXML

37 User Authentication and Access Control Client Application Mediator Wrapper Data source 1 Data source 2 2. User connects to mediator (passes credentials to mediator) 1. User authenticates to system 3.Mediator connects to sources a)Using original user credentials b)Or, mapped credentials (role-based access) 4. Need to define users or roles in sources How about using GAMA for authentication?

38 Different types of heterogeneity in data integration Platform heterogeneity: different OS platforms DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2 Data type heterogeneity Schema heterogeneity Heterogeneity in units, accuracy, resolution Semantic heterogeneity

39 A long standing Computer Science problem Simple case –Mediator View: (SampleID varchar, Rock_Type varchar, Age int) –In Source2 Table, map Age to int Wrapper: convert between int and varchar for Age Wrapper Sample ID: Rock type: Age: … varchar varchar int Schema Integration Sample ID: Rock type: Age: … varchar varchar varchar Source 1 Table Source 2 Table

40 Another integration scenario –Mediator View: (SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar) –In Source 2 Table, parse Age to obtain sub-components of the field Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic “Phanerozoic/mesozoic;jur” Source 1 Table Sample ID: Rock type: Age: varchar varchar varchar Source 2 Table

41 A more advanced integration scenario Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar) –Same as Source1 table schema Query: Get rock types for all rocks from the Jurassic period Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic 150 Source 1 Table Sample ID: Rock type: Age: varchar varchar int Source 2 Table

42 Doing the integration Query sent to mediator: SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’ Query to Source 1: SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’ For Source2, need to map Period=“Jurassic” to Age values Sample ID: Rock type: Age: varchar varchar int Source 2 Table Eon: Era: Period: Min Max varchar varchar varchar int int Geologic_Time Table

43 Query “fragment” sent to Source 2 SELECT DISTINCT (S2.Rock_Type) FROM Source2_Table S2, Geologic_Time_Table GT WHERE GT.Period = ‘Jurrasic’ AND (S2.Age >= GT.Min) AND (S2.Age <= GT.Max) Where is the Geologic_Time table stored ?

44 Data Integration Carts ™ Integrating data sets without explicitly creating views An example request: Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region –Use GEONsearch to find all gravity and geologic data using bounding box for “Rocky Mountain testbed region” Need gazeteer / spatial ontology to determine Rocky Mountain region Need to know classification of datasets (as gravity and geology) Intersect extent of gravity and geologic datasets (from metadata) with extent of Rocky Mountain region –Plot gravity point data that fall within polygons of rocks of given type

45 Ad hoc integration GEONsearch Plot map Map Data Integration Cart ™ Query Search Metadata Catalog “Geologic and gravity data in Rocky Mountains”

46 Data Registration Igneous GraniteQuartzmonzonite Rock Classification Ontology Gravity dataset (X, Y) Metadata Geologic dataset Lat, Long, RockType Metadata Item Detail Registration Item Registration (Schema registration) Location LatitudeLongitude Spatial OntologyPoint Polygon

47

48 Another complex query Query: Get rock types for all rocks from the mesozoic era –Easy to do for Source 1: Era = “Mesozoic” –For Source 2: Need to find numeric age range for Mesozoic –Find age range across all subclasses of Mesozoic (Cretaceous, Jurassic, Triassic) Select all Source 2 Table records whose age range falls within the Mesozoic age range