Presentation is loading. Please wait.

Presentation is loading. Please wait.

1. Definition: Big data applies to the information that cant be processed or analyzed using traditional processes or tools. Case study:

Similar presentations


Presentation on theme: "1. Definition: Big data applies to the information that cant be processed or analyzed using traditional processes or tools. Case study:"— Presentation transcript:

1

2

3

4

5

6

7

8

9

10

11 1. Definition: Big data applies to the information that cant be processed or analyzed using traditional processes or tools. Case study: Facebook.

12 Defining Big Data - The 3Vs

13 1.1 Data Volume

14 Primary attribute of big data. Volume of data being stored today is exploding We store everything: environmental data, financial data, medical data, surveillance data, and the list goes on and on.

15 1.1 Data Volume Organizations are facing massive volumes of data. Some that don’t know how to manage this data are overwhelmed by it But the opportunity exists, with the right technology platform, to analyze almost all of the data to gain a better understanding of your business, your customers, and the marketplace.

16 1.2 Data Variety

17 Data in has become more and more complex it includes not only traditional relational data, but also raw, semi-structured, and unstructured data from web pages, web log files (including click-stream data), search indexes, social media, forums, e-mail, documents, sensor data from active and passive systems, and so on…

18 1.2 Data Variety An organization’s success will rely on its ability to draw insights from the various kinds of data available to it, which includes both traditional and nontraditional. To capitalize on the Big Data opportunity, enterprises must be able to analyze all types of data, both relational and non-relational: text, sensor data, audio, video, transactional, and more

19 1.3 Data Velocity

20 How quickly the data is arriving and stored, and its associated rates of retrieval Enterprises are dealing with petabytes of data instead of terabytes, and the increase in RFID sensors and other information streams => it impossible for traditional systems to handle

21 1.3 Data Velocity Sometimes, getting an edge over your competition can mean identifying a trend, problem, or opportunity only seconds, or even microseconds, before someone else. In addition, more and more of the data being produced today has a very short shelf-life, so organizations must be able to analyze this data in near real time if they hope to find insights in this data

22 Review 5Vs Big Data

23

24 2. Hadoop Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

25 2. Hadoop Framework. In this case, it means that everything you need to develop and run software applications is provided – programs, connections, etc. Massive storage. The Hadoop framework breaks big data into blocks, which are stored on clusters of commodity hardware. Processing power. Hadoop concurrently processes large amounts of data using multiple low-cost computers for fast results.

26

27 2.1 Benefits of Hadoop Computing power. Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. Scalability. You can easily grow your system simply by adding more nodes. Little administration is required.

28 2.1 Benefits of Hadoop Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.

29 2.2 What is Hadoop used for? Going beyond its original goal of searching millions (or billions) of web pages and returning relevant results, many organizations are looking to Hadoop as their next big data platform. Popular uses today include: Low-cost storage and active data archive Staging area for a data warehouse and analytics store. Data lake. Sandbox for discovery and analysis Recommendation systems.

30 Now, switch to Mr.Tri with his presentation

31 By far, the simplest of the NoSQL databases are those employing the key value pair (KVP) model. KVP databases do not require a schema (like RDBMSs) and offer great flexibility and scalability. KVP databases do not offer ACID (Atomicity, Consistency, Isolation, Durability) capability, and require implementers to think about data placement, replication, and fault tolerance as they are not expressly controlled by the technology itself. KVP databases are not typed. As a result, most of the data is stored as strings. 3.Key – Value Pair Database

32 3.Key – Value Pair Database(cont)

33 As the number of users increases, keeping track of precise keys and related values can be challenging. If you need to keep track of the opinions of millions of users, the number of key-value pairs associated with them can increase exponentially. You might need some additional help organizing data in a key-value database. Most offer the capability to aggregate keys (and their related values) into a collection. Collections can consist of any number of key-value pairs 3.Key – Value Pair Database(cont)

34 One widely used open source key-value pair database is called Riak (Http://wiki.basho.com).Http://wiki.basho.com It is developed and supported by a company called Basho Technologies (www.basho.com) and is made available under the Apache Software License v2.0. 3.Key – Value Pair Database(cont)

35 Riak is a very fast and scalable implementation of a key-value database.. Riak supports a high-volume environment with fast-changing data because it is lightweight. Riak is particularly effective at real-time analysis of trading in financial services.. Riak Key – Value Pair Database

36 It uses “buckets” as an organizing mechanism for collections of keys and values. Riak implementations are clusters of physical or virtual nodes arranged in a peer-to- peer fashion. No master node exists, so the cluster is resilient and highly scalable. All data and operations are distributed across the cluster. Larger clusters (with more nodes) perform better and faster than clusters with fewer nodes. Communication in the cluster is implemented via a special protocol called Gossip. Gossip stores status information about the cluster and shares information about buckets. Riak Key – Value Pair Database(cont)

37

38 Riak has many features : Parallel processing : Using MapReduce, Riak supports a capability to decompose and recompose queries across the cluster for real-time analysis and computation.. Search : Buckets can be indexed for rapid resolution of value to keys. Secondary indexes : Developers can tag values with one or more key field values. The application can then query the index and return a list of matching keys. This can be very useful in big data implementations. Riak Key – Value Pair Database(cont)

39 Riak implementations are best suited for User data for social networks, communities, or gaming. High-volume, media-rich data gathering and storage Caching layers for connecting RDBMS and NoSQL databases. Mobile applications requiring flexibility and dependability Riak Key – Value Pair Database(cont)

40 There are two kinds of document databases : One is often described as a repository for full document-style content (Word files, complete web pages,…). The other is a database for storing document components for permanent storage as a static entity or for dynamic assembly of the parts of a document. For big data implementations, both styles are important The structure of the documents and their parts is provided by JavaScript Object Notation (JSON) and/or Binary JSON (BSON). Document databases are most useful when you have to produce a lot of reports and they need to be dynamically assembled from elements that change frequently. 4.Document Databases

41 JSON JSON is a data-interchange format, based on a subset of the JavaScript programming language. Although part of a programming language, it is textual in nature and very easy to read and write. Two basic structures exist in JSON, and they are supported by many modern programming languages : The first basic structure is a collection of name/value pairs, and they are represented programmatically as objects, records, keyed lists, dictionary,hash table,… The second basic structure is an ordered list of values, and they are represented programmatically as arrays, lists, or sequences. 4.Document Databases(cont)

42 MongoDB (www.mongodb.com) is the project name for the “hu(mongo)us database” system. It is maintained by a company called 10gen ( www.10gen.com) as open source.www.10gen.com MongoDB is growing in popularity and may be a good choice for the data store supporting your big data implementation. MongoDB is composed of databases containing “collections.” A collection is composed of “documents,” and each document is composed of fields. 4.Document Databases:MongoDB

43 Just as in relational databases, you can index a collection. Doing so increases the performance of data lookup. Unlike other databases, however, MongoDB returns something called a “cursor,” which serves as a pointer to the data. This is a very useful capability because it offers the option of counting or classifying the data without extracting it. Natively, MongoDB supports BSON 4.Document Databases:MongoDB(cont)

44 A grid-based file system (GridFS), enabling the storage of large objects by dividing them among multiple documents. A sharding service that distributes a single database across a cluster of servers in a single or in multiple data centers. MapReduce to support analytics and aggregation of different collections/documents.. A querying service that supports distributed queries, and full-text search. MongoDB is also an ecosystem consisting of the following elements: 4.Document Databases:MongoDB(cont)

45 High-volume content management Social networking Archiving Real-time analytics Effective MongoDB implementations include: 4.Document Databases:MongoDB(cont)

46 Like MongoDB, CouchDB is open source. It is maintained by the Apache Software Foundation (www.apache.org) and is made available under the Apache License v2.0. CouchDB databases are composed of documents consisting of fields and attachments The advantage in CouchDB over relational is that the data is packaged and ready for manipulation or storage rather than scattered across rows and tables 4.Document Databases:CouchDB

47 Compaction : The databases are compressed to eliminate wasted space when a certain level of emptiness is reached. This helps performance and efficiency for persistence. View model : A mechanism for filtering, organizing, and reporting on data utilizing a set of definitions that are stored as documents in the database. You find a one-to- many relationship of databases to views, so you can create many different ways of representing the data you have. Replication and distributed services : Document storage is designed to provide bidirectional replication. Partial replicas can be maintained to support criteria-based distribution or migration to devices with limited connectivity. Native replication is peer based, but you can implement Master/Slave, Master/Master, and other types of replication modalities. CouchDB is also an ecosystem with the following capabilities: 4.Document Databases:CouchDB(cont)

48 High-volume content management Scaling from smartphone to data center Applications with limited or slow network connectivity Effective CouchDB implementations include : 4.Document Databases:CouchDB(cont)

49 5. Columnar Databases Known as column-oriented databases Store data in columns rather than rows E.g.: data stored in memory: Row: US,Alpha,3000;US,Beta,1250;JP,Alpha,700;UK,Alpha,4 50; Column: US,US,JP,UK;Alpha,Beta,Alpha,Alpha;3000,1250,700,4 50;

50 5.Columnar databases(cont) Efficiency of hard-disk access Compression Update and delete are hard Versus Row-oriented databases: With bigdata: Fast analysis with limited set of data Hbase, C-store, Cassandra, …

51 Cassandra

52 Cassandra(cont) Open source Peer - to - peer architecture Elastic scalability High availability and fault tolerance High performance Column oriented Tunable consistency Shema free

53 6.Graph databases Using graph structure for semantic queries Structure: Nodes: represent entities Properties: informations relate to node, using Key- Value Edges: lines connect nodes

54 6.Graph databases(cont) Fraud Detection Real-Time Recommendation Engines Master Data Management(MDM) Network and IT Operations Identity and Access Management(IAM) Graph-base Search

55 6.Graph databases(cont) Harder to do summing queries and max queries efficiently - counting queries not harder Learn new language It’s new Database: Neo4j FlockDB GraphDb ….

56


Download ppt "1. Definition: Big data applies to the information that cant be processed or analyzed using traditional processes or tools. Case study:"

Similar presentations


Ads by Google