Download presentation
Presentation is loading. Please wait.
Published byAnn Charles Modified over 6 years ago
1
It’s All About the Data Distributed Cloud Computing
Pat Hoeffel, ISS/Polaris December 5, 2016
2
Distributed Cloud Computing
3
Cloud Computing “Cloud Computing, by definition, refers to the on-demand delivery of IT resources and applications via the Internet with pay-as-you-go pricing.” – Amazon Cloud computing, often referred to as simply “the cloud,” is the delivery of on-demand computing resources—everything from applications to data centers—over the Internet on a pay-for-use basis. – IBM Cloud computing, also on-demand computing, is a kind of Internet-based computing that provides shared processing resources and data to computers and other devices on demand. – Wikipedia
4
Cloud Computing Cloud Computing is a service layer on top of a software-defined, virtualized, elastic, pay-as-you-go Infrastructure-as-a-Service offering that is enabled by, and wholly dependent on, the principles and technologies that comprise distributed systems. - Pat Hoeffel
5
Top Players in Cloud Services
Source
6
Example 1: SolrCloud (demo)
Show the ISS PCF instance with 50 shards across 2 nodes. Scale to 4 nodes and rebalance. Then come back to it later.
7
It’s All About that … Distributed Computing did not happen because of processing constraints…
8
It’s All About that … Distributed Computing happened because of storage constraints…
9
It’s All About that DATA
Distributed Computing happened because of storage constraints…
10
It’s All About that DATA
Distributed Computing happened because of storage constraints…
11
It’s All About that DATA
Distributed Computing happened because of storage constraints…
12
It’s All About that DATA
Distributed Computing happened because of storage constraints…
13
It’s All About that DATA
14
Data is growing faster than Moore’s Law
The growth in data size is outpacing the growth in computing power (Moore’s Law) The growth in data size is accelerating There is too much data to be managed or interpreted by humans. Machines are now required to make meaningful use of the data (that they created, like logs, for example) Ray Kurzweil’s Law of Accelerating Returns
15
It’s All About The DATA
16
Data Example: Netflix Facts and Numbers 1/3 of internet traffic
16,000 VMs in AWS Using 1900 Hadoop Nodes (S3) 100M+ videos/day 500 Billion new events/day - logged to S3 25 Petabytes of data (as of summer 2015) Primary tools are Java and Python
17
Data Example: YouTube Facts and Numbers
Total number of people who use YouTube – 1,300,000,000 300 hours of video are uploaded to YouTube every minute! Almost 5 billion videos are watched on YouTube every single day. The total number of hours of video watched on YouTube each month – 900 million. 10,113 YouTube videos generated over 1 billion views. The average number of mobile YouTube video views per day is 1,000,000,000 Mobile YouTube users spent 40 minutes on average session, up more than 50% year-over-year
18
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk
19
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk
20
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk
21
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk
22
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk
23
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk
24
A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk Backup and Disaster Recovery Strategy ** Required ** One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk
25
A Brief History of Data Storage
One machine with one disk, talks to nothing else The Network *IS* the Strategy Data Distribution and Replication make backups unnecessary and DR automatic! (Which is good, because the data volume makes backups impractical, if not impossible, and the data velocity means that by the time you complete a restore, the data would be so old (hours/days?) that it would be effectively useless.) Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk
26
Scaling – “Up” vs “Out” Take one machine, make it bigger.
-> Mainframes, Supercomputers Take one machine, duplicate it. -> Clusters, Clouds
27
Fallacies of Distributed Computing
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Source: Wikipedia
28
“CAP” Theorem and “ACID”
Atomicity Consistency Isolation Durability vs. Availability Partition Tolerance Relational Databases Distributed Databases Eric Brewer **WARNING** - Gross generalizations! Read the details for yourself, please!
29
More on CAP
30
More on CAP RDBMS Oracle SQL Server Postgres MySQL … Solr MongoDB
BigTable Hbase Redis MemcacheDB Cassandra CouchDB Riak Dynamo Voldemort
31
Eventual Consistency In a distributed environment, it will take “greater than zero” time to propagate a write across all replicas. This a called the inconsistency window. (Werner Vogels)
32
Eventual Consistency
33
Hadoop largely started it
Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct Google releases papers with GFS (Google File System) Dec Google releases papers with MapReduce Nutch used GFS and MapReduce to perform operations Yahoo! created Hadoop based on GFS and MapReduce (with Doug Cutting and team) Yahoo started using Hadoop on a 1000 node cluster Jan Apache took over Hadoop Jul Tested a 4000 node cluster with Hadoop successfully Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages. Dec Hadoop releases version 1.0 Aug Version is available
34
Distributed Data Tools
35
Distributed Data Tools
Apache Software Foundation
36
What We’ll Talk About Today
Zookeeper for Service Coordination SolrCloud for Scalable Search
37
Introduction to Apache ZooKeeper
Slides shamelessly borrowed from Saurav Haloi
38
Distributed Computing with Zookeeper
39
Distributed Computing with Zookeeper
40
Distributed Computing with Zookeeper
41
Distributed Computing with Zookeeper
Hadoop Spark Lucidworks Fusion Cloudera Pivotal Cloud Foundry
42
Zookeeper Demo inside SolrCloud
43
Are those the only Options?
There are many options. Here are a few. Gossip as used by Cassandra Zen Server as used by Elastic Any hardened, shared-memory, distributed data store Custom, home-grown HTTP-based protocol
44
What is SolrCloud A few detailed slides on SolrCloud, borrowed from Tim Potter, Solr committer.
45
SolrCloud
46
SolrCloud
47
SolrCloud
48
SolrCloud
49
SolrCloud
50
SolrCloud
51
SolrCloud
52
Back to the Demo Solr Admin Console on PCF PCF Ops Manager
KLA Dashboard on PCF Amazon EC2 Console
53
Solr 6.1+ – Key Features Parallel SQL Streaming Expressions
JDBC for Solr Streaming Expressions Cross Data Center Replication (CDCR) Graph Traversal (can return GraphML) GeoJSON Response Writer Logistic Regression (Machine Learning 6.2+ via the classify function) Baked-in Alerting capability
54
Solr Streaming API daemon(id="uniqueId", runInterval="1000",
update(destinationCollection, batchSize=100, topic(checkpointCollection, topicCollection, q="topic query", fl="id, title, abstract, text", id="topicId") ) )
55
Solr Streaming API - Sources
Stream Sources search jdbc facet features gatherNodes model random shortestPath stats train topic
56
Solr Streaming API - Decorators
Stream Decorators classify commit complement daemon executor fetch leftOuterJoin hashJoin innerJoin intersect merge outerHashJoin parallel reduce rollup scoreNodes select sort top unique update
57
Example parallel(workerCollection, workers=10, sort="DaemonOp desc", daemon(id="myDaemon", terminate="true", update(destinationCollection, batchSize=300, select(from as from_s, to as to_s, body as body_t, topic(checkpointCollection, dataCollection, q="*:*", fl="id, from, to, body", id="myTopic", rows="300", initialCheckpoint="0", partitionKeys="id")))))
58
So what do we do with this?
59
Questions? ??
60
Backup
61
SolrCloud
62
SolrCloud
63
The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Service
64
The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service
65
The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service
66
The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service
67
The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service
68
The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes It’s in the Cloud! Service The disk is now everywhere, and for all intents and purposes, we no longer really care anymore!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.