Download presentation
Presentation is loading. Please wait.
Published byChad Stevenson Modified over 9 years ago
1
1 Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru San Diego Supercomputer Center
2
2 Hardware Integrated Cyberinfrastructure System Source: Dr. Deborah Crawford, Chair, NSF CI Working Committee Middleware Services Development Tools & Libraries Applications Geosciences Environmental Sciences Neurosciences High Energy Physics … Domain-specific Cybertools (software) Shared Cybertools (software) Distributed Resources (computation, storage, communication, etc.) Education and Training Discovery & Innovation
3
3 Community Cyberinfrastructure Projects Middleware Services Development Tools & Libraries Distributed Computing, Instruments and Data Resources Friendly Work-Facilitating Portals Authentication - Authorization - Auditing - Workflows - Visualization - Analysis Biomedical Informatics (BIRN) High Enegy Physics (GriPhyN) Geosciences (GEON) Ecological Observatories (NEON) Earthquake Engineering (NEES) Ocean Observing (ORION) Hardware Adapted from: Prof. Mark Ellisman, UC San Diego Shared Tools Science Domains Your Specific Tools & User Apps.
4
4 Data, Tools, & Computation Data –Field observations –Laboratory analyses –Sensor-based data (land, airborne, satellite) Tools –QA/QC, simple transformations and analyses –Complex models Computation –Community codes –Access to high-performance computing –Data Intensive Computing
5
5 Variety of Geoinformatics Efforts Data collection –Digital data collection in the field –“When does it become cyberinfrastructure”? Database curation –E.g. EarthChem, Paleobiology, MorphoBank, Paleo Pollen, etc…. –When does it become “tools” and “community codes” Software Development –Tools: gravity and magnetics, paleogeography, geochemistry, seismic data products, … –Community codes: SCEC-CME, CIG, …
6
6 Variety of Geoinformatics Efforts High Performance Computing –LiDAR data management –Seismic analyses –Petascale initiative Data Integration –E.g. CUAHSI HIS –Also, a pressing need in projects like EarthScope
7
7 Cyberinfrastructure To provide access to all of these “resources” and support “interoperability” among them Cyberinfrastructure: The Common Platform Across Distributed Projects Data Collection Data Management And Curation Tool Development Modeling and Integration
8
8 Example: USArray Data Flow Deploy field sensor arrays –Across US Collect data from sensor arrays and perform QA/QC –One of the sites is SIO, San Diego Archive data for community access –IRIS, Seattle EarthScope/USArray: Single project, multiple participants.
9
9 D. Harding, NASA Point Cloud x, y, z, … Example: LiDAR Workflow Courtesy: Chris Crosby, ASU Survey Analyze / “Do Science” Interpolate / Grid Single goal: Multiple projects, multiple participants, e.g. NCALM, GEON, ASU, NASA, USGS, …
10
10 GEON Cyberinfrastructure Funded by NSF IT Research program Multi-institution collaboration between IT and Earth Science researchers GEON Cyberinfrastructure provides: –Authenticated access to data and Web services –Registration of data sets, tools, and services with metadata –Search for data, tools, and services, using ontologies –Scientific workflow environment and access to HPC –Data and map integration capability –Scientific data visualization and GIS mapping
11
11 Key Informatics Areas Portals –Authenticated, role-based access to cyber resources: data, tools, models, model outputs, collaboration spaces, … Data Integration –Search, discovery and integration of data from heterogeneous information sources (“mediation” and “semantic integration”) Use of workflow systems, and access to HPC –Ability to “program” at a higher level of abstraction –Sharing of models, along with “provenance” information –Gateways to HPC environments Management of Geospatial Information –Using GIS capabilities, map services, geospatial data integration Visualization of 3D, 4D geospatial data and information
12
12 Distributed System Definition A Distributed System is –one in which the hardware and software components in networked computers communicate and coordinate their activities only by passing messages, e.g. the Internet A Distributed Database System is –one in which data is stored at several sites, each managed by a database system (DBMS) that can run independently
13
13 Distributed System Models Client – Server Client A Client B Server 1 Client C Network invocation response Process 1 Process 3 Process 2 Network Peer to Peer
14
14 Remote Service Invocation TCP/IP –Basic Internet protocol for computer communications –Platform for building a number of other open or proprietary, “higher-level” communications protocols Communication at a higher-level of abstraction http –Open protocol based on TCP/IP for the Web –Fixed set of “verbs” (actions) used to transfer HTML documents CORBA, Java RMI –Protocols based on an object model
15
15 SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX User Dublin Core Resource, Mthd, User User Defined Application Meta-data Remote Proxies DataCutter Metadata Extraction C, C++, Linux I/O Unix Shell Java, NT Browsers Web Prolog Predicate MCAT SDSC Storage Resource Broker “Virtualizing” storage http://www.sdsc.edu/srb
16
16 SRB Client/Server Model SRB Client Network SRB Server Networ k SRB Server B SRB peer-to- peer protocol Oracle Server Oracle Client Networ k HPSS Client HPSS server Data are requested using an SRB ID and a “file abstraction” (open, close, read, write)
17
17 OpenDAP Client/Server model OpenDAP Clients Network OpenDAP Servers
18
18 OpenDAP From: Peter Cornillon & Jim Gallagher http://www.opendap.org/support/stennis_tutorial.html http://www.opendap.org/support/stennis_tutorial.html Data Matlab HDF4JDBC FreeFromFITS CDFCEDAR Data netCDF HDF4Matlab Data DSP Data JGOFS TablesSQLFITSCDF Flat Binary CEDAR Data CODAR Data ESML GeneralCODAR Servers netCDF C netCDF Java IDVFerret GrADS VisADncBrowse MatlabExcelIDL Access Matlab Client IDL Client Clients
19
19 Data are requested with a URL. http://www.cdc.noaa.gov/cgi-bin/nph-nc/datasets/Reynolds_sst Protocol Machine name OPeNDAP server Directory File name ?sst[10:10][0:90][0:180] Constraint User can impose a constraint on the data to be acquired from a data set by appending a constraint expression to the end of the URL OpenDAP Data Request
20
20 Remote Service Invocation with Web Services A Web Service is a simple protocol for invoking remote services on the Web. It is: –A network “endpoint”, i.e. server, that implements one or more “ports”. `Each port is defined by the message types that accepts and the messages it returns. –Specified by a “Web Service Definition Language” xml document. Given the WSDL for a web service you know all you need to interact with it. Web Service Standards also exist for security, policy, reliability, addressing, notification, choreography and workflow. –It is the basis for MS.NET, IBM Websphere, SUN, Oracle, BEA, HP, … –It is the basis for the new Grid standards like WSRF and OGSA.
21
21 Web Site vs Web Service From: “ Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 Web Site –Designed to pass http get/post/put request to between a browser and a web server. –Google has a web site. Web Service –Designed for services to talk to other services by exchanging xml messages –Google also provides a web service so Google may be used in distributed apps Client’s Browser Web Server Web Server Web Service Web Service Web Service Web Service Web Service Web Service
22
22 Grid Services From: “ Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 Grid: A distributed, heterogeneous set of resources –Integrated by a pervasive layer of services –Goal: allow users to view it as a single system More than the Internet (which forms part of the resource layer) Builds on the Web by building on web services Security Data Management Service Data Management Service Accounting Service Accounting Service Logging Event Service Policy Administration & Monitoring Administration & Monitoring Grid Orchestration Registries and Name binding Registries and Name binding Reservations And Scheduling Reservations And Scheduling Open Grid Service Architecture Layer Web Services Resource Framework – Web Services Notification Physical Resource Layer
23
23 Access Interfaces and Levels of Access Web service, native application program interface, ODBC/JDBC, filesystem filesystem DBMS Web Server “stack” SOAP server stack Application Program Mount remote filesystems Expose ODBC/JDBC interface (and full SQL) URLs and http WSDL and SOAP Application can also be “wrapped” as a Web Service SRB, OpenDAP, etc…
24
24 Authentication Client – Server models Client A Server 1 Network User Client-side authentication Server-side authentication Server 2 Server 3 ? ?
25
25 Common Authentication Certificate Authority Client Obtain Credentials Server 1 Invoke with Credentials Verify Credentials Server 2Server 3
26
26 Portal server 2 Grid Account Management Architecture (GAMA): Single sign-on in GEON (also used in a number of other projects) Karan Bhatia, Kurt Mueller, Choonhan Youn, Sandeep Chandra Portal server 1 GAMA server CACL MyproxyCAS OGSA Grid services wrapper … Servlet container import user retrieve credential Stand-alone applications retrieve credential DB gridportlets Java keystore gama GridSphere Servlet container create user
27
27 Systems Issues Load Balancing, Failover, Replication Client Server 1 Server 2 Server 3 Multiple servers for load balancing, failover Data replication
28
28 Distributed Data Access What is the issue? Ability to access data stored in multiple, different databases using a single request, e.g. –Get geologic information from multiple geologic databases –Get employee information from all branches Ability to update data stored in multiple databases, e.g. –Transfer salary amount from University to my bank account –Transfer funds from Visa account to vendor’s account
29
29 Distributed data access Client Database 1Database 2Database 3 Homogeneous: mySQL mySQL mySQL Heterogeneous: mySQL Oracle DB2 How about creating a “cached” local copy? mySQLExcelASCII flat file Sources may be data repositories or metadata catalogs
30
30 Data Warehousing Client Data Source 1 Data Source 2Data Source 3 Data Warehouse (common schema) ETL – Extract – Transform – Load ETL 1. Load data from sources to warehouse 2. Query processing interaction only between client and warehouse But, warehouse data could be “stale”, i.e. out of synch with source data…
31
31 Data integration via middleware Client Database 1 Database 2Database 3 Data integration Middleware (aka Mediator) 1. Each client request goes to sources, via middleware 2. Result collected by middleware and returned to client
32
32 Warehousing vs Mediation Warehousing: User ETL to “massage” local data to fit into a common global, warehouse schema Mediation: Modify user query to match schemas exported by each source –But, which schema does the user query? –The Integrated View Schema –Sources “export” a view (the export schema) Federated databases –Local sources belong to different “administrative domains”, i.e. different owners. –Local autonomy
33
33 The Canonical Mediator / Wrapper Architecture Client Application Wrapper Mediator (Integrated view in mediator data model, e.g. relational, XML) Local view in local data model Export view in mediator data model Q1Q1 Q 11 Q 12 Q 13 Q 14 Cached data Wrapper processes could execute at sources, at mediator, or elsewhere q 14 Data source 1 Local schema Data source 2 Local schema Data source 3 Local schema Data source 4 Local schema
34
34 Example: A Relational Mediator Client Application Mediator (Relational data model) Wrapper Relational DBMS e.g. PostGIS Shape file
35
35 Example: A Shape-file Based Mediator Client Application Mediator (Shape file-based data model) Wrapper Relational DBMS e.g. PostGIS Shape file
36
36 Example: An XML Mediator User / Applications Mediator (XML-based data model, e.g. GML) Wrapper Relational DBMS e.g. PostGIS Shape file Wrapper XML file e.g. ArcXML
37
37 User Authentication and Access Control Client Application Mediator Wrapper Data source 1 Data source 2 2. User connects to mediator (passes credentials to mediator) 1. User authenticates to system 3.Mediator connects to sources a)Using original user credentials b)Or, mapped credentials (role-based access) 4. Need to define users or roles in sources How about using GAMA for authentication?
38
38 Different types of heterogeneity in data integration Platform heterogeneity: different OS platforms DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2 Data type heterogeneity Schema heterogeneity Heterogeneity in units, accuracy, resolution Semantic heterogeneity
39
39 A long standing Computer Science problem Simple case –Mediator View: (SampleID varchar, Rock_Type varchar, Age int) –In Source2 Table, map Age to int Wrapper: convert between int and varchar for Age Wrapper Sample ID: Rock type: Age: … varchar varchar int Schema Integration Sample ID: Rock type: Age: … varchar varchar varchar Source 1 Table Source 2 Table
40
40 Another integration scenario –Mediator View: (SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar) –In Source 2 Table, parse Age to obtain sub-components of the field Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic “Phanerozoic/mesozoic;jur” Source 1 Table Sample ID: Rock type: Age: varchar varchar varchar Source 2 Table
41
41 A more advanced integration scenario Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar) –Same as Source1 table schema Query: Get rock types for all rocks from the Jurassic period Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic 150 Source 1 Table Sample ID: Rock type: Age: varchar varchar int Source 2 Table
42
42 Doing the integration Query sent to mediator: SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’ Query to Source 1: SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’ For Source2, need to map Period=“Jurassic” to Age values Sample ID: Rock type: Age: varchar varchar int Source 2 Table Eon: Era: Period: Min Max varchar varchar varchar int int Geologic_Time Table
43
43 Query “fragment” sent to Source 2 SELECT DISTINCT (S2.Rock_Type) FROM Source2_Table S2, Geologic_Time_Table GT WHERE GT.Period = ‘Jurrasic’ AND (S2.Age >= GT.Min) AND (S2.Age <= GT.Max) Where is the Geologic_Time table stored ?
44
44 Data Integration Carts ™ Integrating data sets without explicitly creating views An example request: Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region –Use GEONsearch to find all gravity and geologic data using bounding box for “Rocky Mountain testbed region” Need gazeteer / spatial ontology to determine Rocky Mountain region Need to know classification of datasets (as gravity and geology) Intersect extent of gravity and geologic datasets (from metadata) with extent of Rocky Mountain region –Plot gravity point data that fall within polygons of rocks of given type
45
45 Ad hoc integration GEONsearch Plot map Map Data Integration Cart ™ Query Search Metadata Catalog “Geologic and gravity data in Rocky Mountains”
46
46 Data Registration Igneous GraniteQuartzmonzonite Rock Classification Ontology Gravity dataset (X, Y) Metadata Geologic dataset Lat, Long, RockType Metadata Item Detail Registration Item Registration (Schema registration) Location LatitudeLongitude Spatial OntologyPoint Polygon
47
47
48
48 Another complex query Query: Get rock types for all rocks from the mesozoic era –Easy to do for Source 1: Era = “Mesozoic” –For Source 2: Need to find numeric age range for Mesozoic –Find age range across all subclasses of Mesozoic (Cretaceous, Jurassic, Triassic) Select all Source 2 Table records whose age range falls within the Mesozoic age range
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.