Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning
Why Now? But Moore’s law has applied for a long time Why is data exploding now? Why not 10 years ago? Why not 20? But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
Size Matters, but … If it were just availability of data then existing big companies would adopt big data technology first
Size Matters, but … If it were just availability of data then existing big companies would adopt big data technology first They didn’t
Or Maybe Cost If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
Or Maybe Cost If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t
Backwards adoption Under almost any threshold argument startups would not adopt big data technology first
Backwards adoption Under almost any threshold argument startups would not adopt big data technology first They did
Everywhere at Once? Something very strange is happening Big data is being applied at many different scales At many value scales By large companies and small
Everywhere at Once? Something very strange is happening Why? Big data is being applied at many different scales At many value scales By large companies and small Why?
Analytics Scaling Laws Analytics scaling is all about the 80-20 rule Big gains for little initial effort Rapidly diminishing returns The key to net value is how costs scale Old school – exponential scaling Big data – linear scaling, low constant Cost/performance has changed radically IF you can use many commodity boxes The different kinds of scaling laws have different shape and I think that shape is the key.
Most data isn’t worth much in isolation Later data is dregs The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase. First data is valuable
Suddenly worth processing But has high aggregate value Later data is dregs The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase. First data is valuable
If we can handle the scale It’s really big The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
So what makes that possible?
In classical analytics, the cost of doing analytics increases sharply.
Net value optimum has a sharp peak well before maximum effort The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
But scaling laws are changing both slope and shape New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
More than just a little
They are changing a LOT!
This next sequence shows how the net value changes with different slope linear cost models.
Notice how the best net value has jumped up significantly
And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
Then a tipping point is reached and things change radically … Initially, linear cost scaling actually makes things worse
Pre-requisites for Tipping To reach the tipping point, Algorithms must scale out horizontally On commodity hardware That can and will fail Data practice must change Denormalized is the new black Flexible data dictionaries are the rule Structured data becomes rare
Inferentially Forbidden Practices Old data should not be changed, schemas must be flexible Global state isn’t and now should be discarded in favor of before and after All processes become streams Scale is nearly inevitable
Inferentially Forbidden Practices Old data should not be changed, schemas must be flexible Note Apache Drill Global state isn’t and now should be discarded in favor of before and after See various large-scale databases, Spanner, MapR DB All processes become streams More about this coming up Scale is nearly inevitable
What System Architecture? Development Speed ≈ V – S – C Total developer volume Internal communication Coupling
Communication Cost S + C Team Size
Which System Will Be Done Soonest?
Which System Will Be Done Soonest? V – S – C V – S – C V – S – C
Which System Will Be Done Soonest? V – S – C V – S – C V – S – C
Micro-services Wins, ESB does not As systems grow, there is almost no choice but to adopt something like micro-services Messaging systems without global transactions (Kafka-esque) have already won for streaming systems
How Should Arrows Work? Implementations should be hidden to decrease C Consider streaming micro-services Implementations should be hidden to decrease C Either process could be a batch process Either process (or both) could be running, or not => Messaging must be persistent
Will Arrows Be Used? Programmers adopt REST interfaces easily Scale enough and universal access Adoption of streaming lags Perceived performance and scale issues => streaming must be pervasive and performant
We Already Have Some Winners Only very few streaming systems can meet the requirements of scaling, persistence, performance and pervasiveness Kafka-esque designs are essentially required
Use Case Straight streaming
Financial Services Use Case Customer handles bids and asks for stocks for off-exchange trading Need: routing of information to recipients in such a way so as to supported the core required queries Core queries: each recipient and sender would like to know what transactions they have received or sent during any period of time that period most commonly being from a few minutes ago to the present time Also want to be able to show a history of bids & offers for each stock
Financial Services Use Case For reference assume: 1000 - 10,000 unique senders and receivers each bid or offer includes 10 recipients on average bids and offers arrive at a rate of 300k messages / second Customer tried to get this to work with Hbase / HW (and utterly failed) What would you do?
Financial Services: Stream First Solution
Key discussion points System handles nearly 4 million inserts running on 3 nodes This design doesn’t use a database. Is that good or bad? Real-time queries easily implemented directly against streams Archiving to compressed column files allows long term analytics Aggregates to DB allows live dashboards
Extreme streaming pays off big
Use Case Platform replication
Basic Situation Multiple locations Each location has many pumps
What Does a Pump Look Like Voltage Current Temperature Pressure Flow Temperature Pressure Flow Winding temperature
Basic Situation Multiple locations Each location has many pumps
Basic Architecture Reflects Business Structure
One Stream Has Many Topics
Use Case Massive IoT
Massive IoT Requirements: Cars roam between data centers 100 million cars 2kB / second Cars roam between data centers
Conclusions Scale and speed changes core architectural trade-offs Streaming is ideal abstraction for much of micro-services load All major persistence abstractions must be first-class
What is Convergence? Files Tables Streams
Call to action: Require convergence
Short Books by Ted Dunning & Ellen Friedman Published by O’Reilly in 2014 - 2016 For sale from Amazon or O’Reilly Free e-books currently available courtesy of MapR http://bit.ly/recommendation-ebook http://bit.ly/ebook-anomaly http://bit.ly/mapr-tsdb-ebook http://bit.ly/ebook-real-world-hadoop
Streaming Architecture by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly) Free copies on MapR.com http://bit.ly/mapr-ebook-streams
Thank You!
Q & A Engage with us! @mapr maprtech mapr-technologies MapR tdunning@maprtech.com maprtech