DB2 Information Integrator Software

DB2 Information Integrator Software
Jaffa Sztejnbok IT Specialist, Information Management Global Technology Unit

Agenda What is Enterprise Information Integration
Without Information Integrator Data Challenges Complementary Information Integration Approaches IBM DB2 Information Integration Products and Value IBM’s Information Integrator 8.1 Demo

What is Enterprise Information Integration ?
Provides access to diverse, distributed, and real-time data as if it were a single source, no matter where it resides. Helps businesses : Shorten application development time Improve productivity and application efficiency Leverage existing data assets for the benefit of the business מידלוור המאפשר גישה למגון של נתונים שישובים הצורה מבוזרת כאילו הם מקור נתונים אחד בין אם זה מקור אחד או לא מאפשר אנטגרציה של STRUCTRED ן UN שליפה ועדכון בזמן אמת לבצע המרות של DATA לניתןחים עסקיים וכן הכפלת./ העברת נתונים לשיפור ביצועים וזמינןת Enterprise Information Integration is a category of middleware that allows access to data as though it were in a single database, whether or not it is....thus.. The integration of data and content sources to provide real time read/write access, the ability to transform data for business analysis and data interchange When you discuss distributed access, the first thing that often comes to mind is performance and availability. To address these concerns, we have data placement capabilities. Caching helps when performance and availability are at issue.

Without Data Federation

Data Challenges Variety, Velocity, and Volume
New composite applications need data from multiple sources Consumers expect holistic, personalized, and value-added content Relational, XML, packaged applications, content repositories, file systems all contain critical business information Increasing emphasis on current data Real-time analytics Business activity monitoring Petabytes will be the measure of available online data All client interactions are important ( e.g. instant messages, audio records, Web traffic,…) Internet and intranet content Let's focus in on Information Integration and start by discussing the data challenges. This is seen as greater variety of data, with reqmts for more current data and increasing volumes of data. New composite applications need business critical information from relational stores, XML documents, content repositories and file systems.Consumers expect personalized views and communications, with value added suggestions (suggested products based on types of past purchases..books, clothing, investments, insurance, bank accounts) There is an increasing emphasize on current data for analysis. This means the need to bridge historical data (typically in the warehouse) and real time data. This is required for customer responsiveness. Business Activity monitoring is required for operational efficiency. Once an event is identified as unusual, our federation capability helps present the important information necessary to handle the unusual event. All client interactions are important, and we see Increased volumes of data exist as we capture all the different information formats: documents, visual, audio, clickstream, messages. These characteristics and challenges, combine to make the case for distributed access. It is difficult to have all the required information in a single store or type of storage.

Complementary Information Integration Approaches
Consolidate data for local access Data warehouses Operational data stores Production applications Creating additional reference copies Typically managed by ETL (Extract, Transform, Load) or replication technologies Integrated access to distributed sources, Distributed Access Real time data, e.g., stock quotes , Extending a data warehouse with real-time data Data changes rapidly Wide heterogeneity in data to be accessed ,relational and non-relational format Data which is not practical or possible to copy and when movement of data is small Want a stable data view with the ability to control refresh intervals Complex long running joins or transformations are required Predictable and repeatable access Need high performance for applications "bring data to application" High availability required can't tolerate network outages Need historical or trending information At IBM, we believe customers Information Integration requirements consists of distributed access (federation), but also consolidated access (Replication & ETL). This chart helps show how a customer would know when to use one or the other..but often they are used together. Example: Use ETL to build a warehouse, replication to keep it automatically updated on a scheduled basis and extend it with federation for queries that require data that it didn't make sense to put in the warehouse. EII or distributed access approaches are indicated when Access performance and load on source systems can be traded for overall lower cost implementation. Currency requirements demand a fresh copy of the data Widely heterogeneous data,or Data that changes rapidly Data security, licensing restrictions, or industry regulations restrict data movement Unique functions must be accessed at data source Queries returning small result sets among federated systems, Large volume data that are infrequently accessed, ETL or replication approaches are indicated when Access performance or availability requirements demand centralized or local data. Complex transformation is required to achieve semantically consistent data Complex, multidimensional queries Currency requirements demand point-in-time consistency, e.g. close of business Information Integration combines different technologies that are complementary. Some data needs to be consolidated for local access. This is typically accomplished by ETL or Replication technologies and is usually the best method for building data warehouses, operational data stores and the data for production applications. Integrated Access adds to consolidated data, by federating it with data that it doesn't make sense to put in the consolidated store..examples include the need for real time data, or to join mixed format data. Integrated Access represents an emerging industry category, referred to as Enterprise Information Integration While federation is not new, what is new is the customer interest in this space due to their data challenges.

IBM DB2 Information Integration Products
Data Sources DB2 Information Integrator SQL programming model Leverage SQL skills and tools Federated data server and replication server data warehouses DB2 Information Integrator for Content Content programming model Leverage CM skills and tools Federated data server, text mining, and workflow engine spreadsheets relational databases @ … Extended Search Sources Content Sources content repositories office reports fax The DB2 Information Integrator family consists of two products. Both offer federation access to diverse and distributed data and content stores, but each presents a different programming model tailored to a different developer community. Here you see a representative set a data sources that these offerings access. For the most part, both offerings can access all the sources shown with a couple exceptions.

DB2 Information Integrator 8.1
A Federated Data Server – Query distributed data as if it where a single source Define integrated view across diverse and distributed data Wide range of data and content sources Extensible to virtually any data source Query as if a single source Use standard SQL query and SQL expressions Include text semantics in the search Surface specialized functions into SQL Leverage query optimization and caching Compose XML documents Combine diverse sources Validate against DTDs or schema Publish results to a message queue Familiar DB programming model Single source, relational updates Integrated SQLView DB2 II 8.1 will have both Federation and Replication capabilities. Let's look first at Federation. Extensive sources are available to federate. federated capabilities include federated query, composing XML documents, publishing to mq series queues. Update in this first release is limited to relational sources and only single source at a time. DB2, Oracle, SQL Server, Sybase, Teradata, OLE DB, ODBC, Excel, XML, message queues, Web services, flat files, document repositories, content repositories, LDAP directories, WWW, databases, and more.

DB2 Information Integrator 8.1
A Replication Server – Manage consolidation for performance and availability Distribute data among relational databases DB2, Informix, Microsoft, Oracle, Sybase, Teradata Support flexible topologies Distribution: One to many Consolidation: Many to one Match data movement modes to usage requirements Table-at-a-time for warehouse loading during batch window Transaction-consistent for online data Choose latency characteristics Scheduled, interval-based, continuous Apply transformations in-line Standard SQL expressions or stored procedure execution. DB2 Microsoft Heterogeneous Replication is part of DB2 II 8.1. This is supported between DB2, Informix, Oracle, Sybase, Mircrosoft and to Teradata. Heterogeneous replication was first available in DataJoiner, and it provides for replication to one or many or from one or many. This support is continued in DB2 Information Integrator. Replication is supported at the table level or transaction consistent level. Replication may take place based on a schedule, time intervals, events or be continuous. Transformations between source and target can be specified with SQL or accomplished via stored procedures.

DB2 Information Integrator for Content
Define integrated views across diverse and distributed data IBM Content Manager portfolio and other content repositories e.g. FileNET, Lotus databases, ODBC and JDBC compliant relational databases, and IBM Lotus Extended Search sources (LDAP directories, WWW, databases,…) Search federated data Search application uses the IBM Content Manager API Mine additional metadata from text documents Identify document language Extract entities like names or technical terms Categorize documents based on a taxonomy Group documents based on related content Create a document synopsis Define workflows DB2 Information Integrator for Content is a re-branding of Enterprise Information Portal It provides federated access to CM sources as well as other content repositories and relational sources Content Manager customers are the primary audience for this product. The application interface and programming model for DB2 Information Integrator for Content is based on the CM object API's. Customers who are primarily working with relational databases should look first at DB2 Information Integrator which provides federation of relational and content sources using a SQL API II for Content has unique capabilities for text mining and text analysis Algorithms scan text documents to determine the national language in which it was written Key features of the document, such as proper names or technical terms which can be used for classifying the document can be automatically extracted Documents can be categorized into a customer-specified taxonomy Documents can be grouped based on the content Automatic summarization is available by scanning the document for summary sentences An integrated workflow function is available so that any data retrieved can be part of a workflow process. This workflow is based on the embedded copy of MQ Series Workflow which has been tailored for content integration. .

DB2 Information Integrator Value
Extend current investments Work within your existing infrastructure Consolidate data or access distributed data as if it were a single data source Combine existing data and content assets in new ways Use familiar SQL programming model and existing tools Build on a standards-based, strategic integration platform Speed time to value for composite applications Reduce hand-coding 40%-65% Reduce skill requirements Reduce development time by half Control costs Reduce payroll costs Reduce need to rip and replace Reduce need to manage redundant data We have talked with many IBM and non IBM customers about the value of this product, here are the results of those discussions. Customers have verified these are valuable.

Speeding Application Development
Development effort to handle: Unique interfaces for each data type Joining data from varied sources Aggregation and grouping Correlating data RDBMS II handles: Interfaces for each data type Joining data from varied sources Transformation Correlating data Non-relational data Special features: Set processing In-built db transformation functions Optimisation Automatic local caching Data driven triggers Non-traditional data Application Developer Other SQL is on OPEN Standard SQL is easily testable, independent of the application JDBC, XML, WebServices integrating data sources is so complex programmatically that either you 1) don't do it or 2) you pay the price of moving to an integrated store which is extremely costly and may not be justifiable or 3) you risk developing and maintaining very complex code

Crystal Decisions Vision Challenge Solution Business Value
As a world-leading information infrastructure company, Crystal Decisions helps businesses make better decisions by bringing together their people and their information. Challenge Improve response time for complex queries over distributed heterogeneous data sources Solution Provides transparent, globally optimized access to heterogeneous, distributed data. Crystal Reports accesses the distributed data as if it were a single database. Response time improvement of up to 98% seen in house. Business Value "Users of Crystal Reports and Crystal Enterprise, with DB2 Information Integrator, can … discover new ways to meet the information needs of their organization." Janet Wood, Vice President of Business Development, Crystal Decisions. Competitive Value “DB2 Information Integrator provides Crystal Reports with exceptionally fast and efficient federated querying capability.“ Trevor Smith, Program Manager, Business Development Group, Crystal Decisions Crystal Decisions is an ISV that provides query and reporting software. They understand their clients often want reports that span information across different types of sources. They have provided that in their product, but were interested in exploring our technology to determine if it provided a performance benefit because of our distributed optimization for heterogeneous environments. Crystal found our federation technology improved performance by up to 98% when doing queries against heterogeneous and distributed data. This underscores the complexity of being able to do join processing efficiently. One can expect to see improved application performance using this product vs doing the joins in an application.

Without Data Federation

Federated Access to Diverse Data
This diagram captures the main points of the federation functions in DB2 Information Integrator It is accessible from the web and from standard database clients (CLI, ODBC, JDBC) It provides a unified view over a large set of data sources All of the popular relational sources are supported as you see on the right Along the bottom are the non-relational sources. The access to these are all read-only, except for WS MQ messages which can be written and read There are a number of Life Sciences specific sources - Blast, Hmmer, Documentum, Entrez, BioRS Text data in flat files XML documents Excel spreadsheets WS MQ messages A rich set of sources accessible through IBM Lotus Extended Search. A partial list of sources includes: Notes databases MS Exchange IBM Content Manager Domino.Doc LDAP directories Web Search engines, e.g., Yahoo, Google, CNN... Content sources (CM)

IBM DB2 Information Integrator Software
Data federation Extensible read/write access across diverse data and content sources Database programming model (SQL) Content programming model (OO API) Data placement Caching and replication over heterogeneous information Data transformation SQL, XML, Web services Advanced search and mining Metadata management Part of a complete integration solution XML publishing, consumption, and interchange WebSphere business integration Open platform based on industry standards Integrating diverse business information across and beyond the enterprise Information Integration This is our vision for Information Integration. The fundamental technologies include Federation, Data Placement, and Transformation. Federation needs to service traditional clients, web services, messaging and workflow. To present data in one of three programming models: SQL and CM (Content Manager object oriented API – this is available with DB2 II for Content only). XML based programming model (XQuery) comes later. We are currently working with the standards body's to define an open standard for XQuery. To enable our customers to use the rich semantics developed for their specific types of sources, and to be able to protect their existing investment in SQL or the CM Object Oriented programming models. When performance or availability become a key concern for a specific application, then our solution has caching. We also have replication as a data placement alternative to enable distributed data copies. Integration requires a robust transformation capability. We believe SQL and XML provide for extensive transformations. At IBM, we formed the Institute for Search and Text analysis. This is part of our research organization. They are devoted to advancing text search and text mining and taking that research into our products. Metadata management is important for bringing together all the sources of data and understanding how the linkages happen. XML is a key part of integration today, and so we provide a comprehensive XML capability. We support the ability to both generate and consume XML documents. We have provided this in our current relational database, and this enables our customers to take advantage of the investment they have in their development of relational skills and in relational applications...the best store for XML is in a relational database: DB2. Information Integration is complementary to the Websphere Integration of People and Processes. And our adherence to open standards, makes us the obvious choice for information integration for any Integration solution.

Data Federation Transparency: hides differences among sources Appears to be one source Supports a high level query language Functional compensation and passthru Heterogeneity: integrates data from diverse sources Relational, XML, flat files, spreadsheet, messages, content repositories, Web, … High Function One query integrates data from multiple sources Capabilities of sources as well Extensibility Access wide range of data sources Development wrapper toolkit Autonomy Non-disruptive to data sources, existing applications, systems. Functions Federation is the concept that a collection of resources can be viewed and manipulated as if they were a single resource while retaining their autonomy and integrity. There are significant advantages that federation provides: Transparency: Provide one single API for the application to talk to (independent from the number of sources (back end) accessed. Heterogeneity: Relational, ODBC, Flat Files, XML, MQ, Spreadsheet and more (wrappers and functions) Extensibility and openness: so that Federation can be extended to almost any data source (Wrapper development toolkit and a development environment for functions using WebSphere Studio) High function - so that the functions of the API are available across all sources, whether the back end data source has the function or not Compensation for missing functions. for example a flat file source may not have sort. In this case the data is read by the federation server and the server does the sort Unique functions of the back end source can be made available as sql functions, if the wrapper makes them available. Autonomy: for the data sources, because data sources can be federated with little or no impact to existing applications or systems. Performance: the different phases of the optimizer have functions specifically for distributed queries. For example, where it makes sense to take advantage of functions in a back end source, it does so.

Performance , Optimization of distributed queries
Federation leverages a full database engine Query Processor, Execution Engine, Catalog, Client Access, Security, Transactions Query processing extended for Federated Data Pushdown Analysis Analyze how to decompose a user query Generate an optimal query execution plan using cost estimates including data source knowledge: database statistics, indexes, source functions, server and network capacities Allows function compensation Optimization and speed of federation is DB2's II silver bullet. In blue are the additional items for federated optimization over DB2 optimization.. The SQL is parsed and then is rewritten to perform well for the optimizer it is aimed at. With the knowledge of what SQL will perform better at that source. Pushdown analysis figures out how to decompose the query. Then cost based optimization looks at the normal statistics, but also at what indexes are available, what functions the each data source can provide and the processor speed and network speed/capacity of each of the sources. Then efficient and specific SQL is produced for the SQL sources and an executable plan is produced. The query is driven over both local and distributed data, with functional compensation where the back end doesn't have the functional capability.

Replication Architecture
This shows the IBM Replication Architecture. DB2 capture and IMS capture are log based captures. Informix,Sybase, Oracle, MS SQL Server are trigger based captures. There is no capture for Teradata. Depicted is external application captures. Examples of this is a third party capture application written for IDMS by (International Software Products inToronto, Canada who has a product called DARS.) , or a sample program we provide called Data Difference Utility. DDU will compare two load ready DB2 load copies and place the difference in a staging table. Customers are using this for VSAM CDC. Captures place changed data into staging tables. This provides flexibity to have each target different, different tranformations, different columns and different currency. Apply then applies that to either DB2 or to DB2 Information Integrator's federation engine. Then writes can happen to Informix, Sybase, Oracle, MS SQL Server and Teradata.

Heterogeneous Caching Feature
Improve query performance and availability Administrator defines Materialized Query Table Precomputed or frequently used values Any data from the federated system Application indicates ability to use cache Implicit or explicit use Developer enables cache use If enabled, reads are handled from the cache, writes passed through to the source If not, reads and writes passed through to source Cache refresh managed: Manually By replication Flexible caching topologies supported Heteroegenous Caching is available to improve query performance, by caching information to a materialized query table. You would do this with any frequently used or precomputed values. Refresh of the MQT's can be manual or by replication. MQT's are based on relational data, but you can use them for any federated data...as long as you store it first in a relational table (like DB2).

Wrappers Four important tasks: Data modeling Map data model to relational data model (tables with rows and columns) Map functions into SQL operations Query Planning Represent data source capabilities Push down as much work to data source as sensible Detect missing function at source (so engine can compensate) Supply cost and cardinality information Connection and transaction management Query Execution and data retrieval Execute parts of a user’s query for a specific data source The wrappers technology was developed at IBM's Almaden Research Center and enables adding additonal sources that will be transparent and optimized. There is an SDK to make it easier to develop customer wrappers for sources we haven't provided. Wrappers act as partners to our optimizer, and we have architected a solution that delivers higher performance optimization for all sources whose wrappers provide performance information. Wrappers defines how to map one data model to another. Wrappers actually connect to the source and execute the query against the source and retrieve the information.

Configuration Configuration steps:
Wrapper: the wrapper code module itself Server: a specific data source, with associated attributes User mapping: information needed to connect to a specific server Nickname: a specific data set managed by a server, mapped to rows and columns in the federated server Defined to system via DDL commands GUI administration generates DDL Stored in the system catalog The first step in configuring a data source is to configure the wrapper itself. One wrapper is configured for each type of data. So if you want to access two Oracle data sources, you configure one wrapper for Oracle. 'Configure' in this context means using the 'Create' action in the CC tool. Next is defining to the wrapper the servers which contain the data sources. For the case of the two Oracle data sources, if they resided on two physical servers, then there would be two Server entries in the Oracle wrapper User Mapping is the next step. This specifies the mapping of the userid/password on the federated system to a userid/password on the federated source. This step is primarily for relational data sources although for some non-relational sources, user mappings may also apply. Then last is the nickname specification. A nickname is the id used by an application to reference federated data. A nickname refers to a relational table or for non-relational data, a table-like data object. The nickname specifies the columns, and potentially the rows, of the target data that will be visible using this id. Once this data source information is specified to the Control Center, the Center generates the DDL which defines the wrapper, server, users, nicknames to the federated server. The resulting metadata is stored into the federated server's system catalog. I.e., the SYSTABLE, SYSWRAPPER, SYSSERVER... tables are defined and populated with metadata. Now, how does the DBA get the names of the federated data objects to create these registrations? There's a great 'data discovery' function to help in that effort. Let's go to the next slide.

Administration Tools The Control Center is the administration tool to register new data sources. The GUI helps the administrator with each step in the process. Starting with the control center (left window) the first step is to register the wrapper by clicking the 'Create Wrapper' action. This causes the top middle screen to come up. The administrator merely selects the type of data source from a selection list. The selection list shows all the wrappers that are installed, relational and non-relational. After the type of data source is selected a wrapper object is created in the Control Center screen. Selecting the 'Create Server' action brings up the dialog box (middle, bottom) where the information identifying the server is specified. Optional settings for the server are specified in the 2nd tab in this control (screen in bottom right). Descriptions of the fields and hints are available for the fields in these controls. The next step is to identify the data on the servers that will be available to users. This is where Discovery provides significant help to the administrator. Let's go to the next slide to see how this works.

Discovery for Nicknames
"Create Nicknames" window Customized "Discover" GUI Launches customized GUI Returns Nickname defintions You can define Nicknames directly or you can have Discovery assist you. To do this directly, you need alot of knowledge about the datasource and how to define nicknames. Especially for non-relational sources. The easier way is to use discovery. Discovery can assist by getting or showing the data objects and creating the nickname definitions that are pertinent to the object. Here we see the create Nicknames window, and you can see the nicknames defined and define more directly by clicking on the add button...or you can click the discovery key...and launch the customized Discover GUI, which will show the objects and help you define the Nicknames. The Nickname definitions can then be seen in the Create Nicknames window after a refresh of that window. For relational, you can use the GUI to discovery the remote tables, upon which you can create the nicknames. For non-relational.. Excel spreadsheets...the GUI will bring back all the spreadsheet names..upon which you can create nickname. For Entrez ( a database of articles that are categorized by different tables for different types of articles, you can create a nickname for each table type. For Extended Search, you define a nickname for the search engines to be searched on a specific server, so...a nickname might point to Google, Yahoo and Lotus Notes...then a search of that nickname would look at all three types of sources and return a list of the documents that matched the search criteria arranged in rank order. for XML, which is a hierarchical view..having parent and children, discover would assist in the creation not only of the nickname, but also help create definitions for views on these nicknames, to accommodate the hierarchical nature of XML. So for XML we see nicknames at the server level, but also at the object level. Future use of this night be for user mappings and for user defined functions (but this is not in R1). Wrappers which support discovery Sybase Oracle SQL Server DB2 Informix ODBC Teradata HMMER Entrez XML Flat File Excel Extended Search

Replication Administration
Definitions Manage control definitions for replication Customize names and sizes of objects Operations Start Capture, Apply, Monitor, Analyzer, and Trace Issue commands such as STOP or STATUS Monitoring Perform static and dynamic monitoring Replication Administration facilitates definitions, and the management of operations. Starting captures, and Apply and monitoring for performance and for completion. Alerts are also available to guide administrators to problems, like sites that are unavailable.

Application Development : Access DB2 catalogs and DB2 II federated sources
DB2 Development Center Websphere Studio Database development is a key component in the new Web Sphere Studio Application developer offer. Microsoft Visual Studio .NET

Demo… Query Result SQL Server DB2 Excel

Optimization of distributed queries
Performance Optimization of distributed queries For more information Articles in the System Journal include: Information Integration: A research agenda Information Integration: A new generation of information technology Data integration through database federation XQuery: an XML query language XTABLES: Bridging relational technology and XML XML programming with SQL/XML and XQuery DB2 and Web Services Bringing together content and data management systems: Challenges and opportunities The integration of business intelligence and knowledge management Using flows in information integration

Summary Information integration is a foundation for companies to build an On Demand Operating Environment enabling them to align their IT infrastructure to business priorities DB2 Information Integrator provides access to diverse, distributed, and real-time data as if it were a single source, no matter where it resides. DB2 Information Integrator will help businesses Shorten application development time Improve productivity and application efficiency Rely on IBM’s proven technology and support for open standards

The whole is worth more then its constituent parts
DB2 Information Integrator helps businesses to leverage existing data assets into knowledge for the benefit of the business

Don’t forget to give us feedback
Presentation Code: A4

DB2 Information Integrator Software

Similar presentations

Presentation on theme: "DB2 Information Integrator Software"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DB2 Information Integrator Software

Similar presentations

Presentation on theme: "DB2 Information Integrator Software"— Presentation transcript:

Similar presentations

About project

Feedback