Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Similar presentations


Presentation on theme: "Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158."— Presentation transcript:

1 Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158

2 Enterprise Information

3 page 3 Centralized versus Distributed? Distributed systems occur naturally State of the art does not allow complex queries or deep analysis against distributed information Centralization may also be favored due to lower costs of infrastructure, license and labor, as well as due to their ability to better enforce tighter integrity constraints and other information management policies Ultimately, the decision needs to take into account issues of ownership and control –Technology considerations often are secondary; even so, rational rules for resolving these considerations exist, as described in Distributed Computing Economics paper

4 page 4 Contrasting Business & Technical Information Business domain Technical domain Metadata scaling Data bandwidth scaling SQL schema & query XML or WS schema & query File schema & query Centralized metadata Real-time information Ad hoc query Inconsistent information Pivoting Data mining Search federation Structured sources Distributed archives Distributed complex controls Central control Central archive Stable schemata Schema evolution Unstructured sources Heavy data processingSimple metadata fusion Complex metadata Simpler data fusion ETL Streaming A/V Visualization Dashboards Steering Deep linguistics

5 page 5 The Guiding Principles It is a bad idea to address the following as afterthoughts –Scale –Availability –Integrity The ability to embed function close to data is fundamental to scalable information processing In order to deliver the best performance/$, systems tend to scale out from technology sweet spot of the day Redundancy configured in from the start, as well as mechanisms for early detection and isolation of faults Optimize availability by optimizing recovery –Privacy and security –Compliance / auditability –Retention requirements –Business value –Information quality

6 page 6 Scalable Content Processing Enterprise information is complex Diversity of information sources and formats –Entail complex integration and processing flows –Metadata generation and indexing –Content indexing Protection and security storage data content connectors scalable repository scalable processing e.g. JCR API

7 page 7 Smart Cells Scalable distributed system of self contained, all- inclusive data repositoriesPrinciples Scale-out Federation Intelligence close to data Pluggable platforms supporting proprietary and 3 rd -party storage services Example Platforms for Information Lifecycle Management services Scale out architecture used under cloud information services Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Query Fabric Storage: Block, File, Object & Fragment Content indexing Attribute indexing Supported protocols and APIs

8 page 8 Considerations in Distributed Information Management Information is distributed across heterogeneous sources and has varied provenance  Integration Information management requires information about information  Metadata Useful information is timely and findable  Real-time integration and caching  Indexing  Semantic analysis  Context

9 page 9 Dimensions of Integration

10 pa ge 10 Ecosystem of integration products Metadata –Determines information richness Service Orientation –Determines protocol richness Future –Integration as syndication –Integration aaS SQL-based EII SAP, Oracle, Composite XML-based EII BEA LiquidData, Mark Logic JSR 170 ECI Day WS- based SOA Microsoft, IBM RSS- based NewsGator Pure EAI Tibco, SAG Metadata Service-orientedness Uniform access MOSS, Attivio

11 Points for Discussion in class Consider a healthcare patient information scenario. –Is it mainly transactional or mainly analytic? –Would you lean toward a distributed (EAI) approach or a centralized one (warehouse)? Consider a scenario in which a company wants to drill down into the root causes of customer complaints? –Again, centralized or distributed? Identifying the root cause Tracking the problem –Would real-time integration become a requirement?

12 Points to ponder at home Pros of integration –Connecting the dots –Single view of … –Quality control over Inconsistency Staleness Gaps Cons of integration –Loss of context –Often, read only –Cost –Duplication –Scale –Losing battle? –Risk

13 Where to learn more Data Integration: The Relational Logic Approach by Michael Genesereth, Morgan & Claypool Publishers, 2010

14 Upcoming guest lectures in May Dr. V. Galotra, Oracle –SOA Deep Dive Rahul Nim, Efficient Frontier –Online marketing

15 Questions?

16 NEWS PRESENTATION


Download ppt "Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158."

Similar presentations


Ads by Google