Download presentation
2
Oracle’s Big Data solutions
Jean-Philippe Breysse Oracle Suisse
3
The following is intended to outline our general product direction
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle.
4
USE CASE 3: LOGS ANALYSIS OF SERVERS
Short Description : Daily logs analysis Issues: Find correlations on what drives to failures Log files stored as flat files
5
Oracle Technology mapped to Analytics Landscape
Oracle 12g Oracle 12g Oracle R Enterprise & Oracle Data Mining Master & Reference Structured Oracle Data Integrator Oracle Times Ten Transactions Files Oracle BI Enterprise Oracle Golden Gate Semi-structured Endeca MDEX Machine Generated Oracle NoSQL Oracle Real Time Decisions Oracle Essbase Oracle Hadoop MapReduce Oracle Endeca Information Discovery Text, Image, Video, Audio Oracle Hadoop HDFS Unstructured Data Acquire Organize Analyze Decide
6
Agenda Big Data Solution Spectrum Inside the Big Data Appliance
Big Data Applications Software Big Data Analytics Conclusions 6
7
<Insert Picture Here> Why Everyone Should Care
Big Data Why Everyone Should Care
8
Tapping into Diverse Data Sets
Big Data: Decisions based on all your data Video and Images Machine-Generated Data Social Data Documents Information Architectures Today: Decisions based on database data Transactions
9
A bit of history ... : Developed initially by Doug Cutting (Nutch - Opensource websearch engine) and Yahoo -> inspired by Google’s papers on MapReduce and GFS ( ) resulted in Apache Hadoop (2006) Amazon Dynamo (2007): distributed systems technologies Cassandra: was developed at Facebook (2008) to power their Inbox Search feature (columnar oriented distributed DB) based initially on Dynamo and Bigtable (built by Google) Voldemort: is a distributed data store that is designed as a key-value store used by LinkedIn for high-scalability storage (NoSql key value) Cloudera: . It contributes to Hadoop and related Apache projects and provides a commercial distribution of Hadoop
10
Structured Data vs. Unstructured Data
So What is Big Data Anyway? It’s a matter of perspective. Big Data is both: LARGE AND VARIABLE DATASETS that are difficult for traditional database tools to easily manage – including datasets that once seemed not important or too problematic to deal with. Big Data datasets include: Extremely large files of unstructured or semi-structured data Large and highly distributed datasets that are otherwise difficult to manage as a single unit of information NEW SET OF TECHNOLOGIES that can economically capture, store, manage, and extract value from Big Data datasets – thus facilitating better, more informed business decisions Structured Data vs. Unstructured Data Relational databases work best with structured data – data which has underlying structure (schema) and size that easily fits the specific confines of database columns and rows. Unstructured data is highly variable, lacks fixed structure, and is often too large to easily handle by RDBMS systems. Big Data is about making sense of ALL the information at our disposal – not just the much smaller amount of data that conveniently fits within the confines of a database field such as the mass quantities of variable unstructured data that typically was not considered by enterprises. This is because unstructured problem spaces require the ability to accept and manage data without advanced knowledge of its structure or format. Big Data is not just the nature of the content. It is also about the enabling technologies - like Hadoop and NoSQL - that make it possible help harness its value in ways not generally possible until quite recently. Big data is a big dynamic that seemed to appear from nowhere. But in reality, Big Data (at least in terms of Big Data content) isn't new. It has been around a long time but only just recently moved into the mainstream based upon new technologies that companies like Google and Yahoo have pioneered. Today, Big Data is getting big attention for good reason. Along with the Google/Yahoo-esque technologies that specifically enable it, Big Data is also enabled by inexpensive storage, a proliferation of sensor and data capture technology, increasing connections to information via the cloud and virtualized storage infrastructures, and other innovative software and analysis tools. The term big data draws a lot of attention, but behind the hype there's a simple story. For decades, companies have been making business decisions based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of less structured data: weblogs, social media, , sensors, and photographs that can be mined for useful information. But Big Data also refers to the set of new technology used to manage and extract value from it (ie, make better, more informed decisions). An alternate term for the unstructured data not typically found in relational databases (and unstructured variants such as “less structured” or “semi-structured”data) is “LOW DENSITY” data. This is the data seen by enterprises as having LOW BUSINESS VALUE relative to the enterprise. RDBMS normally handle “HIGH DENSITY” or HIGH BUSINESS VALUE data. The term has nothing to do with how close bits might be packed together! ;) (NOTE: No matter the words, the “Unstructured Data” term is imprecise for several reasons; 1) structure, while not formally defined can still be implied and 2) data with some form of structure may still be characterized as unstructured if its structure is not helpful for the desired processing task, and 3) unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.) Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, files, and unstructured text such as the body of an message, Web page, or word processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (ex: in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data (or sometimes SEMI-STRUCTURED DATA). For example, an HTML Web page is tagged, but HTML mark-up is typically designed solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements although it typically does not capture or convey the semantic meaning of tagged terms. People use unstructured data every day. Although they may not be aware, they use it for creating, storing and retrieving reports, s, spreadsheets and other types of documents. Unstructured data consists of any data stored in an unstructured format at an atomic level. That is, in the unstructured content, there is no conceptual definition and no data type definition - in textual documents, a word is simply a word. Some current technologies used for content searches on unstructured data require tagging entities such as names or applying keywords and meta tags. Therefore, human intervention is required to help make the unstructured data machine readable. People also use structured data every day. Structured data is anything that has an enforced composition to the atomic data types. Structured data is managed by technology that allows for querying and reporting against predetermined data types and understood relationships. Source: IDC Digital Universe Study, Extracting Value from Chaos, June 2011 (sponsored by EMC)
11
<Insert Picture Here>
Drive Value from Big Data Building a Big Data Platform
12
Divided Solution Spectrum
Data Variety Unstructured NoSQL Flexible Specialized Developer Centric SQL Trusted Secure Administered Distributed File Systems Transaction (Key-Value) Stores MapReduce Solutions Schema-less DBMS (DW) DBMS (OLTP) ETL Advanced Analytics Schema Acquire Organize Analyze
13
Hadoop to Oracle – Bridging the Gap
Data Variety Unstructured Hadoop MapReduce HDFS Cassandra Schema-less Oracle Loader for Hadoop RDBMS (OLTP) RDBMS (DW) Advanced Analytics ETL Schema Acquire Organize Analyze
14
Oracle Integrated Software Solution
Data Variety Unstructured Hadoop HDFS Oracle NoSQL DB Oracle Analytics: Data Mining R Spatial Graph mapreduce OBI EE Oracle (DW) Schema-less Oracle Data Integrator Oracle Loader for Hadoop Oracle (OLTP) Schema Acquire Organize Analyze
15
<Insert Picture Here>
Inside the Big Data Appliance Overview
16
Oracle Engineered Solutions
Data Variety Information Density Oracle Database (DW) (OLTP) In-DB Analytics “R” Mining Text Graph Spatial Exalytics Speed of Thought Analytics Oracle NoSQL DB HDFS Hadoop Oracle Data Integrator Oracle Loader for Hadoop Big Data Appliance Hadoop NoSQL Database Oracle Loader for hadoop Oracle Data Integrator Oracle Exadata OLTP & DW Data Mining & Oracle R Semantics Spatial Unstructured Oracle BI EE Schema Acquire Organize Analyze
17
Big Data Appliance Usage Model
Oracle Big Data Appliance Oracle Exadata Oracle Exalytics InfiniBand InfiniBand Stream Acquire Organize Analyze & Visualize
18
Why build a Hadoop Appliance?
Time to Build? Required Expertise? Cost and Difficulty Maintaining?
19
Oracle Big Data Appliance Hardware
18 Sun X4270 M2 Servers 48 GB memory per node = 864 GB memory 12 Intel cores per node = 216 cores 24 TB storage per node = 432 TB storage 40 Gb p/sec InfiniBand 10 Gb p/sec Ethernet
20
09/23/10 Big Data Appliance Cluster of industry standard servers for Hadoop and NoSQL Database Focus on Scalability and Availability at low cost InfiniBand Network Compute and Storage Redundant 40Gb/s switches IB connectivity to Exadata 18 High-performance low-cost servers acting as Hadoop nodes 24 TB Capacity per node 2 6-core CPUs per node Hadoop triple replication NoSQL Database triple replication 10GigE Network 8 10GigE ports Datacenter connectivity 20
21
Scale out by connecting racks to each other using Infiniband
Scale Out to Infinity Scale out by connecting racks to each other using Infiniband Expand up to eight racks without additional switches Scale beyond eight racks by adding an additional switch
22
Oracle Big Data Appliance Software
Oracle Enterprise Linux 5.6 Oracle Hotspot Java VM Cloudera’s Distribution including Apache Hadoop Cloudera Manager Open Source Distribution of R Oracle NoSQL Database Community Edition
23
Why Open-Source Apache Hadoop?
Fast evolution in critical features Built by the Hadoop experts in the community Practical instead of esoteric Focus on what is needed for large clusters Proven at very large scale In production at all the large consumers of Hadoop Extremely stable in those environments Well-understood by practitioners
24
Software Layout Node 1: Node 2: Node 3: Node 4 – 18:
M: Name Node, Balancer & HBase Master S: HDFS Data Node, NoSQL DB Storage Node Node 2: M: Secondary Name Node, Management, Zookeeper, MySQL Slave Node 3: M: JobTracker, MySQL Master, ODI Agent, Hive Server Node 4 – 18: S: HDFS Data Nodes, Task Tracker, HBase Region Server, NoSQL DB Storage Nodes Your MapReduce runs here!
25
<Insert Picture Here>
Big Data Application Software Acquire New Information
26
Key-Value Store Workloads
Large dynamic schema based data repositories Data capture Web applications (click-through capture) Online retail Sensor/statistics/network capture (factory automation for example) Backup services for mobile devices Data services Scalable authentication Real-time communication (MMS, SMS, routing) Personalization Social Networks Key-value Oracle NoSQL DB*, Voldemort*, Tokyo Cabinet, Redis, Riak, CitrusLeaf, GenieDB*, Amazon Dynamo*, Google LevelDB *Built on top of Berkeley DB Columnar Cassandra, Hbase, HyperTable, Google BigTable Document MongoDB, CouchDB, RavenDB Graph OrientDB, GraphDB
27
Oracle NoSQL DB A distributed, scalable key-value database
Simple Data Model Key-value pair with major+sub-key paradigm Read/insert/update/delete operations Scalability Dynamic data partitioning and distribution Optimized data access via intelligent driver High availability One or more replicas Disaster recovery through location of replicas Resilient to partition master failures No single point of failure Transparent load balancing Reads from master or replicas Driver is network topology & latency aware Elastic (Planned for Release 2) Online addition/removal of Storage Nodes Automatic data redistribution Application Application NoSQLDB Driver NoSQLDB Driver Storage Nodes Data Center A Data Center B
28
Oracle NoSQL DB Differentiation
Commercial Grade Software and Support General-purpose Reliable – Based on proven Berkeley DB JE HA Easy to install and configure Scalable throughput, bounded latency Simple Programming and Operational Model Simple Major + Sub key and Value data structure ACID transactions Configurable consistency & durability Easy Management Web-based console, API accessible Manages and Monitors: Topology; Load; Performance; Events; Alerts Completes Oracle large scale data storage offerings
29
<Insert Picture Here>
Big Data Application Software Organizing Data for Analysis
30
Oracle Loader for Hadoop Features
Load data into a partitioned or non-partitioned table Single level, composite or interval partitioned table Support for scalar datatypes of Oracle Database Load into Oracle Database 11g Release 2 Runs as a Hadoop job and supports standard options Pre-partitions and sorts data on Hadoop Online and offline load modes
31
Oracle Loader for Hadoop
Input 1 MAP Shuffle /Sort MAP Shuffle /Sort Oracle Loader for Hadoop MAP Reduce Reduce MAP Shuffle /Sort MAP Reduce MAP Reduce MAP Reduce Reduce Reduce MAP MAP MAP Shuffle /Sort MAP Reduce MAP Reduce MAP Shuffle /Sort Reduce MAP MAP MAP Reduce MAP MAP Reduce Input 2
32
Oracle Loader for Hadoop: Online Option
Read target table metadata from the database Perform partitioning, sorting, and data conversion Connect to the database from reducer nodes, load into database partitions in parallel Oracle Loader for Hadoop MAP Shuffle /Sort Reduce Reduce MAP Shuffle /Sort Reduce MAP Reduce MAP Reduce
33
Oracle Loader for Hadoop: Offline Option
Read target table metadata from the database Perform partitioning, sorting, and data conversion Write from reducer nodes to Oracle Data Pump files Oracle Loader for Hadoop MAP Shuffle /Sort Import into the database in parallel using external table mechanism Reduce Reduce MAP Shuffle /Sort Reduce MAP Reduce MAP Reduce
34
Oracle Loader for Hadoop Advantages
Offload database server processing to Hadoop: Convert input data to final database format Compute table partition for row Sort rows by primary key within a table partition Generate binary datapump files Balance partition groups across reducers
35
Selection Output Option for Use Case
Oracle Loader for Hadoop Output Option Use Case Characteristics Online load with JDBC The simplest use case for non partitioned tables Online load with Direct Path Fast online load for partitioned tables Offline load with datapump files Fastest load method for external tables On Oracle Big Data Appliance Direct HDFS Leave data on HDFS Parallel access from database Import into database when needed Direct HDFS: Access data on HDFS through the external table mechanism Benefits Data on HDFS can be queried from the database Import into the database as needed
36
Automate Usage of Oracle Loader for Hadoop
Oracle Data Integrator (ODI) ODI has knowledge modules to Generate data transformation code to run on Hive/Hadoop Invoke Oracle Loader for Hadoop Use the drag-and-drop interface in ODI to Include invocation of Oracle Loader for Hadoop in any ODI packaged flow
38
<Insert Picture Here>
Big Data Analytics Real Time Analytics Platform
39
R Statistical Programming Language
Open source language and environment Used for statistical computing and graphics Strength in easily producing publication-quality plots Highly extensible with open source community R packages
40
<Insert Picture Here>
Drive Value from Big Data Conclusions
41
Big Data Appliance Big Data for the Enterprise
Optimized and Complete Everything you need to store and integrate your lower information density data Integrated with Oracle Exadata Analyze all your data Easy to Deploy Risk Free, Quick Installation and Setup Single Vendor Support Full Oracle support for the entire system and software set
42
Oracle Integrated Solution Stack for Big Data
ACQUIRE Oracle NoSQL Database HDFS Enterprise Applications ANALYZE In-Database Analytics Data Warehouse ORGANIZE Hadoop (MapReduce) Oracle Loader for Hadoop Oracle Data Integrator DECIDE Oracle Analytic Applications
43
Oracle: Big Data for the Enterprise
The most comprehensive solution Includes everything needed to acquire, organize and analyze all your data Optimized for Extreme Analytics Deepest analytics portfolio with access to all data Engineered to Work Together Eliminate deployment risk and support risk Enterprise Ready Deliver extreme performance and scalability
44
Questions 44
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.