It’s All About the Data Distributed Cloud Computing

It’s All About the Data Distributed Cloud Computing
Pat Hoeffel, ISS/Polaris December 5, 2016

Distributed Cloud Computing

Cloud Computing “Cloud Computing, by definition, refers to the on-demand delivery of IT resources and applications via the Internet with pay-as-you-go pricing.” – Amazon Cloud computing, often referred to as simply “the cloud,” is the delivery of on-demand computing resources—everything from applications to data centers—over the Internet on a pay-for-use basis. – IBM Cloud computing, also on-demand computing, is a kind of Internet-based computing that provides shared processing resources and data to computers and other devices on demand. – Wikipedia

Cloud Computing Cloud Computing is a service layer on top of a software-defined, virtualized, elastic, pay-as-you-go Infrastructure-as-a-Service offering that is enabled by, and wholly dependent on, the principles and technologies that comprise distributed systems. - Pat Hoeffel

Top Players in Cloud Services
Source

Example 1: SolrCloud (demo)
Show the ISS PCF instance with 50 shards across 2 nodes. Scale to 4 nodes and rebalance. Then come back to it later.

It’s All About that … Distributed Computing did not happen because of processing constraints…

It’s All About that … Distributed Computing happened because of storage constraints…

It’s All About that DATA
Distributed Computing happened because of storage constraints…

It’s All About that DATA

Data is growing faster than Moore’s Law
The growth in data size is outpacing the growth in computing power (Moore’s Law) The growth in data size is accelerating There is too much data to be managed or interpreted by humans. Machines are now required to make meaningful use of the data (that they created, like logs, for example) Ray Kurzweil’s Law of Accelerating Returns

It’s All About The DATA

Data Example: Netflix Facts and Numbers 1/3 of internet traffic
16,000 VMs in AWS Using 1900 Hadoop Nodes (S3) 100M+ videos/day 500 Billion new events/day - logged to S3 25 Petabytes of data (as of summer 2015) Primary tools are Java and Python

Data Example: YouTube Facts and Numbers
Total number of people who use YouTube – 1,300,000,000 300 hours of video are uploaded to YouTube every minute! Almost 5 billion videos are watched on YouTube every single day. The total number of hours of video watched on YouTube each month – 900 million. 10,113 YouTube videos generated over 1 billion views. The average number of mobile YouTube video views per day is 1,000,000,000 Mobile YouTube users spent 40 minutes on average session, up more than 50% year-over-year

A Brief History of Data Storage
One machine with one disk, talks to nothing else Disk

One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk

One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk

One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk

One machine with one disk, talks to nothing else Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk

One machine with one disk, talks to nothing else Disk Backup and Disaster Recovery Strategy ** Required ** One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk

One machine with one disk, talks to nothing else The Network *IS* the Strategy Data Distribution and Replication make backups unnecessary and DR automatic! (Which is good, because the data volume makes backups impractical, if not impossible, and the data velocity means that by the time you complete a restore, the data would be so old (hours/days?) that it would be effectively useless.) Disk One machine with multiple disks, Some possibly external Disk Disk One machine talks to a Server with multiple disks Disk Disk Disk Server talks to a NAS/SAN with many multiple disks Disk Disk Disk Disk Machine talks to a forward proxy that fronts for a constellation of virtualized back end servers that all talk to and coordinate with each other to store and retrieve data. Disk

Scaling – “Up” vs “Out” Take one machine, make it bigger.
-> Mainframes, Supercomputers Take one machine, duplicate it. -> Clusters, Clouds

Fallacies of Distributed Computing
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Source: Wikipedia

“CAP” Theorem and “ACID”
Atomicity Consistency Isolation Durability vs. Availability Partition Tolerance Relational Databases Distributed Databases Eric Brewer **WARNING** - Gross generalizations! Read the details for yourself, please!

More on CAP

More on CAP RDBMS Oracle SQL Server Postgres MySQL … Solr MongoDB
BigTable Hbase Redis MemcacheDB Cassandra CouchDB Riak Dynamo Voldemort

Eventual Consistency In a distributed environment, it will take “greater than zero” time to propagate a write across all replicas. This a called the inconsistency window. (Werner Vogels)

Eventual Consistency

Hadoop largely started it
Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct Google releases papers with GFS (Google File System) Dec Google releases papers with MapReduce Nutch used GFS and MapReduce to perform operations Yahoo! created Hadoop based on GFS and MapReduce (with Doug Cutting and team) Yahoo started using Hadoop on a 1000 node cluster Jan Apache took over Hadoop Jul Tested a 4000 node cluster with Hadoop successfully Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages. Dec Hadoop releases version 1.0 Aug Version is available

Distributed Data Tools

Distributed Data Tools
Apache Software Foundation

What We’ll Talk About Today
Zookeeper for Service Coordination SolrCloud for Scalable Search

Introduction to Apache ZooKeeper
Slides shamelessly borrowed from Saurav Haloi

Distributed Computing with Zookeeper

Distributed Computing with Zookeeper
Hadoop Spark Lucidworks Fusion Cloudera Pivotal Cloud Foundry

Zookeeper Demo inside SolrCloud

Are those the only Options?
There are many options. Here are a few. Gossip as used by Cassandra Zen Server as used by Elastic Any hardened, shared-memory, distributed data store Custom, home-grown HTTP-based protocol

What is SolrCloud A few detailed slides on SolrCloud, borrowed from Tim Potter, Solr committer.

SolrCloud

Back to the Demo Solr Admin Console on PCF PCF Ops Manager
KLA Dashboard on PCF Amazon EC2 Console

Solr 6.1+ – Key Features Parallel SQL Streaming Expressions
JDBC for Solr Streaming Expressions Cross Data Center Replication (CDCR) Graph Traversal (can return GraphML) GeoJSON Response Writer Logistic Regression (Machine Learning 6.2+ via the classify function) Baked-in Alerting capability

Solr Streaming API daemon(id="uniqueId", runInterval="1000",
update(destinationCollection, batchSize=100, topic(checkpointCollection, topicCollection, q="topic query", fl="id, title, abstract, text", id="topicId") ) )

Solr Streaming API - Sources
Stream Sources search jdbc facet features gatherNodes model random shortestPath stats train topic

Solr Streaming API - Decorators
Stream Decorators classify commit complement daemon executor fetch leftOuterJoin hashJoin innerJoin intersect merge outerHashJoin parallel reduce rollup scoreNodes select sort top unique update

Example parallel(workerCollection, workers=10, sort="DaemonOp desc", daemon(id="myDaemon", terminate="true", update(destinationCollection, batchSize=300, select(from as from_s, to as to_s, body as body_t, topic(checkpointCollection, dataCollection, q="*:*", fl="id, from, to, body", id="myTopic", rows="300", initialCheckpoint="0", partitionKeys="id")))))

So what do we do with this?

Questions? ??

Backup

SolrCloud

The Virtualized Hardware Stack
Where’s the disk? Physical Machine Virtual Machine Container (Docker) Service

Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes Service

Where’s the disk? Physical Machine Virtual Machine Container (Docker) Yes It’s in the Cloud! Service The disk is now everywhere, and for all intents and purposes, we no longer really care anymore!

It’s All About the Data Distributed Cloud Computing

Similar presentations

Presentation on theme: "It’s All About the Data Distributed Cloud Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

It’s All About the Data Distributed Cloud Computing

Similar presentations

Presentation on theme: "It’s All About the Data Distributed Cloud Computing"— Presentation transcript:

Similar presentations

About project

Feedback