Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator.

Slides:



Advertisements
Similar presentations
TELE202 Lecture 8 Congestion control 1 Lecturer Dr Z. Huang Overview ¥Last Lecture »X.25 »Source: chapter 10 ¥This Lecture »Congestion control »Source:
Advertisements

24-1 Chapter 24. Congestion Control and Quality of Service (part 1) 23.1 Data Traffic 23.2 Congestion 23.3 Congestion Control 23.4 Two Examples.
Valuing firms with no Earnings1 Valuing Companies with Negative Earnings Many start-ups have losses or very small profits for the initial years due to.
SDN and Openflow.
CSSE 533 – Database Systems Week 1, Day 1 Steve Chenoweth CSSE Dept.
What happens when we place a hot bowl of soup in a cool room? ice water at 5 o C air trapped from outside at 25 0 C water at 15 0 C air at 15 0 C Why.
© 2008 Progress Software Corporation1 SOA-33: Transactions in a SOA World What happens next? Flight Booking Hotel Booking Car Booking (3) Calls (2) Change.
Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #MongoDB Introduction to Sharding.
Your Logo Here Do You Know Your Odds? Presented by: Your Name Here.
Introduction: Thinking Like an Economist 1 CHAPTER 2 Production and Cost Analysis II Economic efficiency consists of making things that are worth more.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
© 2014 MapR Technologies 1 Ted Dunning. © 2014 MapR Technologies 2 Me, Us Ted Dunning, MapR Chief Application Architect, Apache Member –Committer PMC.
 Salesforce is a cloud based CRM, which provide business solutions and automate the business processes through.
Measuring and Increasing Profit
A presentation on ElasticSearch
Negotiation Skills Presented by J.W. Owens A Perspective 101 Series
Platform as a Service (PaaS)
STREAMANALYTIX INTRODUCTION
AMI to SmartGrid “DATA”
Enterprise Service Bus
Platform as a Service (PaaS)
Platform as a Service (PaaS)
Business GROWTH With Marketing.
Fast Cars, Big Data How Streaming Can Help Formula 1.
Topics discussed in this section:
Recent trends in estimation methodologies
Pastry Scalable, decentralized object locations and routing for large p2p systems.
Applying Control Theory to Stream Processing Systems
Consumer technology is creating the smart home
Trial.iO Makes it Easy to Provision Software Trials, Demos and Training Environments in the Azure Cloud in One Click, Without Any IT Involvement MICROSOFT.
Operational & Analytical Database
Chapter 5 Output: ERP Reports, Data Warehouses and Intranets
Alejandro Álvarez on behalf of the FTS team
Traffic Audit Industry: Internet of Things (IoT) Ted Politidis Head of SEO
Using Sequence Statistics to Fight Advanced Persistent Threats
Emitter: Scalable, fast and secure pub/sub in Go
THE BASICS.
Automating Profitable Growth™
Case studies – Atlas and PVSS Oracle archiver
Automating Profitable Growth™
April 30th – Scheduling / parallel
Selling Your Home Made Easy
How to Provide Partner Marketing Concierge Help in Bite-Sized Chunks
Thinking In College In this lesson, we’ll explore what it means to be a college-level thinker, and how to develop strong thinking skills. Any questions.
Automating Profitable Growth™
“Today I will let you play a little game, to try it out and experience what our trainings are all about. While I’m working setting it up maybe you could.
UNIT 2: RETEAIL TERMINOLOGY & PRACTICES
dotmailer: A Marketing Automation Platform with at its Core
Thinking In College In this lesson, we’ll explore what it means to be a college-level thinker, and how to develop strong thinking skills. Any questions.
NoSQL Databases Antonino Virgillito.
Overview of big data tools
CSE8380 Parallel and Distributed Processing Presentation
Automating Profitable Growth™
Interpret the execution mode of SQL query in F1 Query paper
TIM TAYLOR AND JOSH NEEDHAM
E145/STS173 Case Study Tips Professors Tom Byers and Randy Komisar
Seven Critical Factors for a Successful Partner Recruitment Program
Summit Nashville /3/2019 1:48 AM
All that matters to customers is their experience
Chapter 3 Database Management
Database System Architecture
Automating Profitable Growth™
Analytics, BI & Data Integration
How to Tackle Science Exams
Why do we need a controlled experimental stock market(CESM)?
Introduction to Recruitment Marketing.
Celemi Apples & Oranges™ – The simulation
Presentation transcript:

Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning

Why Now? But Moore’s law has applied for a long time Why is data exploding now? Why not 10 years ago? Why not 20? But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?

Size Matters, but … If it were just availability of data then existing big companies would adopt big data technology first

Size Matters, but … If it were just availability of data then existing big companies would adopt big data technology first They didn’t

Or Maybe Cost If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

Or Maybe Cost If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t

Backwards adoption Under almost any threshold argument startups would not adopt big data technology first

Backwards adoption Under almost any threshold argument startups would not adopt big data technology first They did

Everywhere at Once? Something very strange is happening Big data is being applied at many different scales At many value scales By large companies and small

Everywhere at Once? Something very strange is happening Why? Big data is being applied at many different scales At many value scales By large companies and small Why?

Analytics Scaling Laws Analytics scaling is all about the 80-20 rule Big gains for little initial effort Rapidly diminishing returns The key to net value is how costs scale Old school – exponential scaling Big data – linear scaling, low constant Cost/performance has changed radically IF you can use many commodity boxes The different kinds of scaling laws have different shape and I think that shape is the key.

Most data isn’t worth much in isolation Later data is dregs The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase. First data is valuable

Suddenly worth processing But has high aggregate value Later data is dregs The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase. First data is valuable

If we can handle the scale It’s really big The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.

So what makes that possible?

In classical analytics, the cost of doing analytics increases sharply.

Net value optimum has a sharp peak well before maximum effort The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.

But scaling laws are changing both slope and shape New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.

More than just a little

They are changing a LOT!

This next sequence shows how the net value changes with different slope linear cost models.

Notice how the best net value has jumped up significantly

And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.

Then a tipping point is reached and things change radically … Initially, linear cost scaling actually makes things worse

Pre-requisites for Tipping To reach the tipping point, Algorithms must scale out horizontally On commodity hardware That can and will fail Data practice must change Denormalized is the new black Flexible data dictionaries are the rule Structured data becomes rare

Inferentially Forbidden Practices Old data should not be changed, schemas must be flexible Global state isn’t and now should be discarded in favor of before and after All processes become streams Scale is nearly inevitable

Inferentially Forbidden Practices Old data should not be changed, schemas must be flexible Note Apache Drill Global state isn’t and now should be discarded in favor of before and after See various large-scale databases, Spanner, MapR DB All processes become streams More about this coming up Scale is nearly inevitable

What System Architecture? Development Speed ≈ V – S – C Total developer volume Internal communication Coupling

Communication Cost S + C Team Size

Which System Will Be Done Soonest?

Which System Will Be Done Soonest? V – S – C V – S – C V – S – C

Which System Will Be Done Soonest? V – S – C V – S – C V – S – C

Micro-services Wins, ESB does not As systems grow, there is almost no choice but to adopt something like micro-services Messaging systems without global transactions (Kafka-esque) have already won for streaming systems

How Should Arrows Work? Implementations should be hidden to decrease C Consider streaming micro-services Implementations should be hidden to decrease C Either process could be a batch process Either process (or both) could be running, or not => Messaging must be persistent

Will Arrows Be Used? Programmers adopt REST interfaces easily Scale enough and universal access Adoption of streaming lags Perceived performance and scale issues => streaming must be pervasive and performant

We Already Have Some Winners Only very few streaming systems can meet the requirements of scaling, persistence, performance and pervasiveness Kafka-esque designs are essentially required

Use Case Straight streaming

Financial Services Use Case Customer handles bids and asks for stocks for off-exchange trading Need: routing of information to recipients in such a way so as to supported the core required queries Core queries: each recipient and sender would like to know what transactions they have received or sent during any period of time that period most commonly being from a few minutes ago to the present time Also want to be able to show a history of bids & offers for each stock

Financial Services Use Case For reference assume: 1000 - 10,000 unique senders and receivers each bid or offer includes 10 recipients on average bids and offers arrive at a rate of 300k messages / second Customer tried to get this to work with Hbase / HW (and utterly failed) What would you do?

Financial Services: Stream First Solution

Key discussion points System handles nearly 4 million inserts running on 3 nodes This design doesn’t use a database. Is that good or bad? Real-time queries easily implemented directly against streams Archiving to compressed column files allows long term analytics Aggregates to DB allows live dashboards

Extreme streaming pays off big

Use Case Platform replication

Basic Situation Multiple locations Each location has many pumps

What Does a Pump Look Like Voltage Current Temperature Pressure Flow Temperature Pressure Flow Winding temperature

Basic Situation Multiple locations Each location has many pumps

Basic Architecture Reflects Business Structure

One Stream Has Many Topics

Use Case Massive IoT

Massive IoT Requirements: Cars roam between data centers 100 million cars 2kB / second Cars roam between data centers

Conclusions Scale and speed changes core architectural trade-offs Streaming is ideal abstraction for much of micro-services load All major persistence abstractions must be first-class

What is Convergence? Files Tables Streams

Call to action: Require convergence

Short Books by Ted Dunning & Ellen Friedman Published by O’Reilly in 2014 - 2016 For sale from Amazon or O’Reilly Free e-books currently available courtesy of MapR http://bit.ly/recommendation-ebook http://bit.ly/ebook-anomaly http://bit.ly/mapr-tsdb-ebook http://bit.ly/ebook-real-world-hadoop

Streaming Architecture by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly) Free copies on MapR.com http://bit.ly/mapr-ebook-streams

Thank You!

Q & A Engage with us! @mapr maprtech mapr-technologies MapR tdunning@maprtech.com maprtech