Presentation is loading. Please wait.

Presentation is loading. Please wait.

It’s All About the Data Distributed Cloud Computing

Similar presentations


Presentation on theme: "It’s All About the Data Distributed Cloud Computing"— Presentation transcript:

1 It’s All About the Data Distributed Cloud Computing
Pat Hoeffel, ISS/Polaris December 5, 2016

2 Distributed Cloud Computing

3 Cloud Computing “Cloud Computing, by definition, refers to the on-demand delivery of IT resources and applications via the Internet with pay-as-you-go pricing.” – Amazon Cloud computing, often referred to as simply “the cloud,” is the delivery of on-demand computing resources—everything from applications to data centers—over the Internet on a pay-for-use basis. – IBM Cloud computing, also on-demand computing, is a kind of Internet-based computing that provides shared processing resources and data to computers and other devices on demand. – Wikipedia

4 Cloud Computing Cloud Computing is a service layer on top of a software-defined, virtualized, elastic, pay-as-you-go Infrastructure-as-a-Service offering that is enabled by, and wholly dependent on, the principles and technologies that comprise distributed systems. - Pat Hoeffel

5 Top Players in Cloud Services
Source

6 Example 1: SolrCloud (demo)
Show the ISS PCF instance with 50 shards across 2 nodes. Scale to 4 nodes and rebalance. Then come back to it later.

7 It’s All About that … Distributed Computing did not happen because of processing constraints…

8 It’s All About that … Distributed Computing happened because of storage constraints…

9 It’s All About that DATA
Distributed Computing happened because of storage constraints…

10 It’s All About that DATA
Distributed Computing happened because of storage constraints…

11 It’s All About that DATA
Distributed Computing happened because of storage constraints…

12 It’s All About that DATA
Distributed Computing happened because of storage constraints…

13 It’s All About that DATA

14 Data is growing faster than Moore’s Law
The growth in data size is outpacing the growth in computing power (Moore’s Law) The growth in data size is accelerating There is too much data to be managed or interpreted by humans. Machines are now required to make meaningful use of the data (that they created, like logs, for example) Ray Kurzweil’s Law of Accelerating Returns

15 It’s All About The DATA

16 Data Example: Netflix Facts and Numbers 1/3 of internet traffic
16,000 VMs in AWS Using 1900 Hadoop Nodes (S3) 100M+ videos/day 500 Billion new events/day - logged to S3 25 Petabytes of data (as of summer 2015) Primary tools are Java and Python

17 Data Example: YouTube Facts and Numbers
Total number of people who use YouTube – 1,300,000,000 300 hours of video are uploaded to YouTube every minute! Almost 5 billion videos are watched on YouTube every single day. The total number of hours of video watched on YouTube each month – 900 million. 10,113 YouTube videos generated over 1 billion views. The average number of mobile YouTube video views per day is 1,000,000,000 Mobile YouTube users spent 40 minutes on average session, up more than 50% year-over-year

18 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk

19 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk

20 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk

21 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk

22 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk

23 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk

24 A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk Backup and Disaster Recovery Strategy ** Required ** One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk

25 A Brief History of Data Storage
One machine with one disk, talks to nothing else The Network *IS* the Strategy Data Distribution and Replication make backups unnecessary and DR automatic! (Which is good, because the data volume makes backups impractical, if not impossible, and the data velocity means that by the time you complete a restore, the data would be so old (hours/days?) that it would be effectively useless.) Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk

26 Scaling – “Up” vs “Out” Take one machine, make it bigger.
-> Mainframes, Supercomputers Take one machine, duplicate it. -> Clusters, Clouds

27 Fallacies of Distributed Computing
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Source: Wikipedia

28 “CAP” Theorem and “ACID”
Atomicity Consistency Isolation Durability vs. Availability Partition Tolerance Relational Databases Distributed Databases Eric Brewer **WARNING** - Gross generalizations! Read the details for yourself, please!

29 More on CAP

30 More on CAP RDBMS Oracle SQL Server Postgres MySQL … Solr MongoDB
BigTable Hbase Redis MemcacheDB Cassandra CouchDB Riak Dynamo Voldemort

31 Eventual Consistency In a distributed environment, it will take “greater than zero” time to propagate a write across all replicas. This a called the inconsistency window. (Werner Vogels)

32 Eventual Consistency

33 Hadoop largely started it
Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct Google releases papers with GFS (Google File System) Dec Google releases papers with MapReduce Nutch used GFS and MapReduce to perform operations Yahoo! created Hadoop based on GFS and MapReduce (with Doug Cutting and team) Yahoo started using Hadoop on a 1000 node cluster Jan Apache took over Hadoop Jul Tested a 4000 node cluster with Hadoop successfully Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages. Dec Hadoop releases version 1.0 Aug Version is available

34 Distributed Data Tools

35 Distributed Data Tools
Apache Software Foundation

36 What We’ll Talk About Today
Zookeeper for Service Coordination SolrCloud for Scalable Search

37 Introduction to Apache ZooKeeper
Slides shamelessly borrowed from Saurav Haloi

38 Distributed Computing with Zookeeper

39 Distributed Computing with Zookeeper

40 Distributed Computing with Zookeeper

41 Distributed Computing with Zookeeper
Hadoop Spark Lucidworks Fusion Cloudera Pivotal Cloud Foundry

42 Zookeeper Demo inside SolrCloud

43 Are those the only Options?
There are many options. Here are a few. Gossip as used by Cassandra Zen Server as used by Elastic Any hardened, shared-memory, distributed data store Custom, home-grown HTTP-based protocol

44 What is SolrCloud A few detailed slides on SolrCloud, borrowed from Tim Potter, Solr committer.

45 SolrCloud

46 SolrCloud

47 SolrCloud

48 SolrCloud

49 SolrCloud

50 SolrCloud

51 SolrCloud

52 Back to the Demo Solr Admin Console on PCF PCF Ops Manager
KLA Dashboard on PCF Amazon EC2 Console

53 Solr 6.1+ – Key Features Parallel SQL Streaming Expressions
JDBC for Solr Streaming Expressions Cross Data Center Replication (CDCR) Graph Traversal (can return GraphML) GeoJSON Response Writer Logistic Regression (Machine Learning 6.2+ via the classify function) Baked-in Alerting capability

54 Solr Streaming API daemon(id="uniqueId", runInterval="1000",
       update(destinationCollection,                batchSize=100,                topic(checkpointCollection,                      topicCollection,                      q="topic query",                      fl="id, title, abstract, text",                      id="topicId")                )         )

55 Solr Streaming API - Sources
Stream Sources search jdbc facet  features gatherNodes model random shortestPath stats train topic

56 Solr Streaming API - Decorators
Stream Decorators classify commit complement daemon executor fetch leftOuterJoin hashJoin innerJoin intersect merge outerHashJoin parallel reduce rollup scoreNodes select sort top unique update

57 Example parallel(workerCollection,                workers=10,                sort="DaemonOp desc",                daemon(id="myDaemon",                               terminate="true",                               update(destinationCollection,                                            batchSize=300,                                            select(from as from_s,                                                       to as to_s,                                                       body as body_t,                                                       topic(checkpointCollection,                                                                 dataCollection,                                                                 q="*:*",                                                                 fl="id, from, to, body",                                                                 id="myTopic",                                                                 rows="300",                                                                 initialCheckpoint="0",                                                                 partitionKeys="id")))))

58 So what do we do with this?

59 Questions? ??

60 Backup

61 SolrCloud

62 SolrCloud

63 The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Service

64 The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service

65 The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service

66 The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service

67 The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service

68 The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes It’s in the Cloud! Service The disk is now everywhere, and for all intents and purposes, we no longer really care anymore!


Download ppt "It’s All About the Data Distributed Cloud Computing"

Similar presentations


Ads by Google