Web Services and Integration/Mediation Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems March 4, 2008
2 Today Reminder HW2 Milestone 2 due Thursday Distributed programming, concluded: RPC and Web Services Then onward: How we can translate between structured data formats Mediators and information integration
Some Common Modes of Building Distributed Applications Data-intensive: XQuery (fetch XML from multiple sites, produce new XML) Turing-complete functional programming language Good for Web Services; not much support for I/O, etc. MapReduce (built over DHT or distributed file system) Single filter (map), followed by single aggregation (reduce) Languages over it: Sawzall, Pig Latin, Dryad, … Message passing / request-response: e.g., over a DHT, sockets, or message queue Communication via asynchronous messages Processing in message handler loop Function calls: Remote procedure call / remote method invocation 3
4 How RPC Generally Works You write an application with a series of functions One of these functions, F, will be distributed remotely You call a “stub generator” A caller stub emulates the function F: Opens a connection to the server Requests F, marshalling all parameters Receives F’s return status and parameters A server stub emulates the caller: Receives a request for F with parameters Unmarshals the parameters, invokes F Takes F’s return status (e.g., protection fault), return value, and marshals it back to the client
5 Passing Value Parameters Steps involved in doing remote computation through RPC 2-8
6 RPC Components Generally, you need to write: Your function, in a compatible language An interface definition, analogous to a C header file, so other people can program for F without having its source Generally, software will take the interface definition and generate the appropriate stubs (In the case of Java, RMIC knows enough about Java to run directly on the source file) The server stubs will generally run in some type of daemon process on the server Each function will need a globally unique name or GUID
7 Parameter Passing Can Be Tricky Because of References The situation when passing an object by reference or by value. 2-18
8 What Are the Hard Problems with RPC? Esp. Inter-Language RPC? Resolving different data formats between languages (e.g., Java vs. Fortran arrays) Reliability, security Finding remote procedures in the first place Extensibility/maintainability (Some of these might look familiar from when we talked about data exchange!)
9 Web Services Goal: provide an infrastructure for connecting components, building applications in a way similar to hyperlinks between data It’s another distributed computing platform for the Web Goal: Internet-scale, language-independent, upwards-compatible where possible This one is based on many familiar concepts Standard protocols: HTTP Standard marshalling formats: XML-based, XML Schemas All new data formats are XML-based
10 Three Parts to Web Services 1.“Wire” / messaging protocols Data encodings, RPC calls or document passing, etc. 2.Describing what goes on the wire Schemas for the data 3.“Service discovery” Means of finding web services
11 The Protocol Stacks of Web Services Enhanced + expanded from a figure from IBM’s “Web Services Insider”, Other extensions SOAP Attachments WS-Security WS-AtomicTransaction, WS-Coordination SOAP, XML-RPC XML XML Schema Service Description (WSDL) Service Capabilities (WS-Capability) Message Sequencing Orchestration (WS-BPEL) Inspection Directory (UDDI) Wire Format Stack Discovery Stack Description Stack WS-Addressing High-level state transition + msging diagrams between modules
12 Messaging Protocol: SOAP Simple Object Access Protocol: XML-based format for passing parameters Has a SOAP header and body inside an envelope As a defined HTTP binding (POST with content-type of application/soap+xml) A companion SOAP Attachments encapsulates other (MIME) data The header defines information about processing: encoding, signatures, etc. It’s extensible, and there’s a special attribute called mustUnderstand that is attached to elements that must be supported by the callee The body defines the actual application-defined data
13 A SOAP Envelope 12
14 Making a SOAP Call To execute a call to service PlaceOrder: POST /PlaceOrder HTTP/1.1 Host: my.server.com Content-Type: application/soap+xml; charset=“utf-8” Content-Length: nnn …
15 SOAP Return Values If successful, the SOAP response will generally be another SOAP message with the return data values, much like the request If failure, the contents of the SOAP envelop will generally be a Fault message, along the lines of: SOAP-ENV:Client Could not parse message …
16 How Do We Declare Functions? WSDL is the interface definition language for web services Defines notions of protocol bindings, ports, and services Generally describes data types using XML Schema In CORBA, this was called an IDL In Java, the interface uses the same language as the Java code
17 A WSDL Service Service Port PortType Operation PortType Operation PortType Operation Binding
18 Web Service Terminology Service: the entire Web Service Port: maps a set of port types to a transport binding (a protocol, frequently SOAP, COM, CORBA, …) Port Type: abstract grouping of operations, i.e. a class Operation: the type of operation – request/response, one-way Input message and output message; maybe also fault message Types: the XML Schema type definitions
19 Example WSDL
20 JAX-RPC: Java and Web Services To write JAX-RPC web service “endpoint”, you need two parts: An endpoint interface – this is basically like the IDL statement An implementation class – your actual code public interface BookQuote extends java.rmi.Remote { public float getBookPrice(String isbn) throws java.rmi.RemoteException; } public class BookQuote_Impl_1 implements BookQuote { public float getBookPrice(String isbn) { return 3.22; } }
21 Different Options for Calling The conventional approach is to generate a stub, as in the RPC model described earlier You can also dynamically generate the call to the remote interface, e.g., by looking up an interesting function to call Finally, the “DII” (Dynamic Instance Invocation) method allows you to assemble the SOAP call on your own
22 Creating a Java Web Service A compiler called wscompile is used to generate your WSDL file and stubs You need to start with a configuration file that says something about the service you’re building and the interfaces that you’re converting into Web Services
23 Example Configuration File
24 Starting a WAR The Web Service version of a Java JAR file is a Web Archive, WAR There’s a tool called wsdeploy that generates WAR files Generally this will automatically be called from a build tool such as Ant Finally, you may need to add the WAR file to the appropriate location in Apache Tomcat (or WebSphere, etc.) and enable it See WSPack2/jaxrpc.html for a detailed example WSPack2/jaxrpc.html
25 Finding a Web Service UDDI: Universal Description, Discovery, and Integration registry Think of it as DNS for web services It’s a replicated database, hosted by IBM, HP, SAP, MS UDDI takes SOAP requests to add and query web service interface data
26 What’s in UDDI White pages: Information about business names, contact info, Web site name, etc. Yellow pages: Types of businesses, locations, products Includes predefined taxonomies for location, industry, etc. Green pages – what we probably care the most about: How to interact with business services; business process definitions; etc Pointer to WSDL file(s) Unique ID for each service
27 Data Types in UDDI businessEntity: top-level structure describing info about the business businessService: name and description of a service bindingTemplate: how to access the service tModel (t = type/technical): unique identifier for each service-template specification publisherAssertion: describes relationship between businessEntities (e.g., department, division)
28 Relationships between UDDI Structures publisherAssertion businessEntity businessServicebindingTemplate tModel n 2 1 n 1n m n
29 Example UDDI businessEntity My Books Technical Book Wholesaler … … <!– keyedReferences to tModels …
30 UDDI in Perspective Original idea was that it would just organize itself in a way that people could find anything they wanted Today UDDI is basically a very simple catalog of services, which can be queried with standard APIs It’s not clear that it really does what people really want: they want to find services “like Y” or “that do Z”
31 The Problem: With UDDI and Plenty of Other Situations There’s no universal, unambiguous way of describing “what I mean” Relational database idea of “normalization” doesn’t convert concepts into some normal form – it just helps us cluster our concepts in meaningful ways “Knowledge representation” tries to encode definitions clearly – but even then, much is up to interpretation The best we can do: describe how things relate pollo = chicken = poulet = 雞 = 鸡 = j ī = मुर्गी = murg Note that this mapping may be imprecise or situation-specific! Calling someone a chicken, vs. a chicken that’s a bird
32 This Brings Us Back to XQuery, Whose Main Role Is to Relate XML Suppose we define an XML schema for our target data and our source data A view is a stored query Function from a set of (XML) sources to an XML output In fact, in XQuery, a view is actually called a function Can directly translate between XML schemas or structures Describes a relationship between two items Transform 2 into 6 by “add 4” operation Convert from S1 to S2 by applying the query described by view V Often, we don’t need to transfer all data – instead, we want to use the data at one source to help answer a query over another source…
33 Lazy Evaluation: A Virtual View Source2.xml Source1.xml Virtual XML doc. XQuery Query Form Browser/App Server(s) Query Results XQuery Source2.xml Source1.xml Composed XQuery HTML XSLT
34 Let’s Look at Some Simple Mappings Beginning with examples of using XQuery to convert from one schema to another, e.g., to import data First: let’s review what our XQuery mappings need to accomplish…
35 Challenges of Mapping Schemas In a perfect world, it would be easy to match up items from one schema with another Each element would have a simple correspondence to an element in the other schema Every value would clearly map to a value in the other schema Real world: as with human languages, things don’t map clearly! Different decompositions into elements Different structures Tag name vs. value Values may not exactly correspond It may be unclear whether a value is the same It’s a tough job, but often things can be mapped
36 Example Schemas Bob’s Movie Database … … … … … * * Mary’s Art List … … … … … * Want to map data from one schema to the other
37 Mapping Bob’s Movies Mary’s Art Start with the schema of the output as a template: $i $y $a $s $t Then figure out where to find the values in the source, and create XPaths
38 The Final Schema Mapping Mary’s Art Bob’s Movies for $m in doc(“movie.xml”)//movie, $a in $m/director/text(), $i in $m/title/text(), $t in $m/title/text() return $i movie $a $t Note the absence of subject… We had no reasonable source, so we are leaving it out.
39 Mapping Values Sometimes two schemas use different representations for the same thing ID SSN English Hungarian We typically use an intermediate table defining correspondences – a “concordance table” It can be generated automatically, and then corrected by hand (since there will often be exceptions)
40 An Example Value Mapping Problem Penn student enrollment DB: … Mary McDonald F03 cse Jon Doh Penn dental plan: Dental sealant Want to output student names + treatments…
41 Translating Values with a Concordance Table return { { $n } { $tr }
42 Translating Values with a Concordance Table for $p in doc (“student.xml”) /db/student, $pid in $p/pennid/text(), $n in $p/name/text(), $m in doc (“concord.xml”) /db/mapping, $f in $m/from/text(), $t in $m/to/text(), $d in doc(“dental.xml”)/db/patient, $s in $d/ssn/text(), $tr in $d/treatment/text() where ____________________ return { { $n } { $tr } student.xml: Mary McDonald F03 cse330 $pid: PennID $n: name
43 Translating Values with a Concordance Table for $p in doc (“student.xml”) /db/student, $pid in $p/pennid/text(), $n in $p/name/text(), $d in doc(“dental.xml”)/db/patient, $s in $d/ssn/text(), $tr in $d/treatment/text(), $m in doc (“concord.xml”) /db/mapping, $f in $m/from/text(), $t in $m/to/text() where ____________________ return { { $n } { $tr } student.xml: Mary McDonald F03 cse330 dental.xml: Dental sealant $pid: PennID $n: name $s: ssn $tr: treatment
44 Translating Values with a Concordance Table for $p in doc (“student.xml”) /db/student, $pid in $p/pennid/text(), $n in $p/name/text(), $d in doc(“dental.xml”)/db/patient, $s in $d/ssn/text(), $tr in $d/treatment/text(), $m in doc (“concord.xml”) /db/mapping, $f in $m/from/text(), $t in $m/to/text() where ____________________ return { { $n } { $tr } student.xml: Mary McDonald F03 cse330 dental.xml: Dental sealant concord.xml: $pid: PennID $n: name $s: ssn $tr: treatment $f: PennID $t: ssn
45 Translating Values with a Concordance Table for $p in doc (“student.xml”) /db/student, $pid in $p/pennid/text(), $n in $p/name/text(), $d in doc(“dental.xml”)/db/patient, $s in $d/ssn/text(), $tr in $d/treatment/text(), $m in doc (“concord.xml”) /db/mapping, $f in $m/from/text(), $t in $m/to/text() where ____________________ return { { $n } { $tr } student.xml: Mary McDonald F03 cse330 dental.xml: Dental sealant concord.xml: $pid: PennID $n: name $s: ssn $tr: treatment $f: PennID $t: ssn
46 Drawbacks to Point-to-Point Mappings They can get data from one source to another, but what if you want to see elements that aren’t shared? Painful to create n 2 mappings… Sometimes we don’t actually want to ship the data from one source to another, but to see both We don’t want to put Barnes & Noble’s inventory INTO Amazon’s – but we want to see books from both This leads us to a “mediator” approach…
47 Data Integration and Warehousing Create a middleware “mediator” or “data integration system” over the sources All sources are mapped to a common “mediated schema” Warehouse approach actually has a central database, and load data from the sources into it Virtual approach has just a schema – it consults sources to answer each query The mediator accepts queries over the central schema and returns all relevant answers
48 Data Integration System / Mediator Typical Data Integration Components Mediated Schema Wrapper Source Data Query-based Schema Mappings in Catalog Source Catalog QueryResults
Mediator / Virtual Integration Systems The subject of much research since the 80s and especially 90s Examples: TSIMMIS, Information Manifold, MIX, Garlic, … Original focus on Web Real-world integration companies (IBM, BEA/Oracle, Actuate, …) are focusing on the enterprise – more $$$! A common model (exemplified by TSIMMIS, Garlic): Take the source data Define a schema mapping that produces content for the mediated schema, based on the source data The data for the mediated schema is the “union” of all of the mappings 49
50 Answering Queries in TSIMMIS Based on view unfolding: composing a query and view The query is being posed over the mediated schema for $b in document(“dblp.xml”)/root/book where $b/title/text = “Distributed Systems” and $b/author/text() = “Tanenbaum” return $b Wrappers are responsible for converting data from the source into a subset of the mediated schema for $c in sql(“select author,year,title from CISbook’”) return { $c/* }
51 The Mediated Schema as a Union of Views from Wrappers Wrappers have names, some sort of output schema: define function GetCISBooks() as book* { for $c in sql(“select author,year,title from CISbook’”) return { $c/* } } This gets “unioned” with output from other results: return { { GetCISBooks() } { GetEEBooks() } } book authoryeartitle
52 How to Answer the Query Given our query: for $b in document(“dblp.xml”)/root/book where $b/title/text() = “Distributed Systems” and $b/author/text() = “Tanenbaum” return $b We want to find all wrapper definitions that output the right structure to match our query Book elements with titles and authors (and any other attributes)
53 Query Composition with Views We find all views that define book with author and title, and we compose the query with each of these In our example, we find one wrapper definition that matches: define function GetCISBooks() as book* { for $b in sql(“select author,year,title from CISbook’”) return { $b/* } } for $b in document(“mediated-schema”)/root/book where $b/title/text() = “Distributed Systems” and $b/author/text() = “Tanenbaum” return $b return { { GetCISBooks() } … }
Making It Work for $b in doc (“…”)/root/book where $b/title/text() = “Dist. Systems” and $b/author/text() = “Tanenbaum” return $b 54 book authoryeartitle root authoryeartitle $c $c/author$c/year$c/title
55 The Final Step: Unfolded View The query and the view definition are merged (the view is “unfolded”), yielding, e.g.: for $b in sql(“select author,title,year from CISbook where author=‘Tanenbaum’”) where $b/title/text() = “Distributed Systems” return $b
56 Summary: Mapping, Integrating, and Sharing Data Based on XQuery rather than XSLT “Views” (in XQuery, functions) as the bridge between schemas Joins and nesting are important in creating these views Can do point-to-point mappings to exchange data Very common approach: mediated schema or warehouse Create a central schema – may be virtual Map sources to it Pose queries over this UDDI versus this approach? What about search and its relationship to integration? In particular, search over Amazon, Google Maps, Google, Yahoo, …
57 Next Time… We’ll start looking at information retrieval, which is the basis of Web search