Presentation is loading. Please wait.

Presentation is loading. Please wait.

1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2.

Similar presentations


Presentation on theme: "1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2."— Presentation transcript:

1 1

2 Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2

3 Examples Walmart: 10 6 transactions per hour, all in databases > 2.5 petabytes Large Hadron Collider: 150 million sensors deliver data 40 million times/sec Amazon: millions of sales per day, three largest Linux databases 7.8 TB, 18.5 TB, 24.7 TB 3

4 Database Systems and Big Data RDBMS have trouble handling big data Generally, software running on tens, hundreds or thousands of servers is required We are generally talking about data accumulation and analysis, not transaction processing 4

5 Exabytes! (10 18 bytes) 5

6 MapReduce 2004 paper from Google Map function processes a key/value pair to generate a set of intermediate key/value pairs Reduce function Merges all intermediate values associated with that key Hadoop is an open-source implementation of MapReduce 6

7 MapReduce Example Map Reduce 7

8 MapReduce Execution 8

9 Example Applications 9

10 BASE—an alternative to ACID? BASE is a new approach Basically Available Soft state Eventually consistent Changes the fundamental approach ACID approach frees applications from concern about partial transaction completion—it’s done completely or not at all BASE returns control to the application quickly, but not all operations may be complete BASE is used in the Large Data realm 10

11 The CAP Theorem It is impossible for a distributed computer system to simultaneously guarantee all three of: Consistency. The client perceives that a set of operations has occurred all at once. Availability. Every operation must terminate in an intended response. Partition tolerance. Operations will complete, even if individual components are unavailable. At first this was Brewer’s Conjecture, has now been proved so is called the CAP theorem 11

12 Forfeit Availability 12 Distributed databases Distributed locking Majority protocols X

13 Forfeit Consistency 13 DNS Web Caches X

14 Forfeit Partition Tolerance 14 Single-site databases LDAP X

15 Optimistic Replication Underlies the BASE approach—replicas are allowed to diverge but are ultimately converged Operations: 1. Operation submission: users submit operations from independent sites 2. Propagation: each site shares operations it knows about with other sites 3. Scheduling: each site decides on order for the operations it knows about 4. Conflict resolution: if there are conflicts among operations at a site, the sequence is modified 5. Commitment: sites agree on a final schedule and conflict resolution result, and changes are made permanent Note that a reliable message queue for transactions, and idempotent transactions, are often employed 15

16 Example: CVS CVS does version control Users edit local versions of files Can pull updates from server or can push out updates they think are ready Changes made in order received If conflicts are detected they are flagged for manual repair by users 16

17 Implications Applications must ensure that delayed updates don’t impair user view of correctness Testing in more limited environment can mask problems in larger production environment Validity constraints can become order-sensitive to change operations, cause reconciliation problems 17

18 A Thought There is little work on the continuum between ACID and BASE There could be a PhD dissertation in this area 18

19 19


Download ppt "1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2."

Similar presentations


Ads by Google