Download presentation
Presentation is loading. Please wait.
Published byDeshaun Tunnicliff Modified over 10 years ago
1
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch 1
2
Why Big Data? because bigger is smarter – answer tough questions because we can – push the limits and good things will happen 2
3
bigger = smarter? Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better 3
4
bigger = smarter? Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better But! – more data, more error (e.g., semantic heterogeneity) – with enough data you can prove anything – still need humans to ask right questions 4
5
Fundamental Problem of Big Data There is no ground truth – gets more complicated with self-fulfilling prophecies e.g., stock market predictions change behavior of people e.g., Web search engines determine behavior of people 5
6
Fundamental Problem of Big Data There is no ground truth – gets more complicated with self-fulfilling prophecies Hard to debug: takes human out of the loop – Example: How to play lottery in Napoli Step 1: You visit “oracles” who predict numbers to play Step 2: You visit “interpreters” who explain predictions Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista] 6
7
Why Big Data? because bigger is smarter – answer tough questions because we can – push the limits and good things will happen 7
8
Because we can… Really? Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel 8
9
Because we can… Really? Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel But, – data grows faster than energy on chip – value / cost tradeoff unknown – ownership of data unclear (aggregate vs. individual) I believe that all these “but’s” can be addressed 9
10
Utiliy & Cost Functions of Data Noise / Error Utility Cost 10
11
Utiliy & Cost Functions of Data Noise / Error Utility Cost curated random malicious curated random malicious 11
12
Best Utility/Cost Tradeoff Noise / Error Utility Cost malicious 12
13
What is good enough? Noise / Error Utility Cost curated 13
14
What about platforms? Relational Databases – great for 20% of the data – not great for 80% of the data Hadoop – great for nothing – good enough for (almost) everything (if tweaked) 14
15
Why is Hadoop so popular? availability: open source and free proven technology: nothing new & simple works for all data and queries branding: the big guys use it it has the right abstractions – MR abstracts “counting” (= machine learning) it is an eco-system - it is NOT a platform – HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … – relational database systems – turned into a platform depending on app / problem 15
16
Example: Amadeus Log Service HDFS for compressed logs HBase to index by timestamp and session id SOLR for full text search Hadoop (MR) for usage stats & disasters Oracle to store meta-data (e.g., user information) Disclaimer: under construction & evaluation!!! – current production system is proprietary 16
17
Some things Hadoop got wrong? performance: huge start-up time & overheads productivity: e.g., joins, configuration knobs SLAs: no response time guarantees, no real time Essentially ignored 40 years of DB research 17
18
Some things Hadoop got right scales without (much) thinking moves the computation to the data fault tolerance, load balance, … 18
19
How to improve on Hadoop Option 1: Push our knowledge into Hadoop? – implement joins, recursion, … Option 2: Push Hadoop into RDBMS? – build a Hadoop-enabled database system Option 3: Build new Hadoop components – real-time, etc. Option 4: Patterns to compose components – log service, machine learning, … – but, do not build a “super-Hadoop” 19
20
Conclusion Focus on “because we can…” part – help data scientists to make everything work Stick to our guns – develop clever algorithms & data structures – develop modeling tools and languages – develop abstractions for data, errors, failures, … – develop “glue”; get the plumbing right Package our results correctly – find the right abstractions (=> APIs of building blocks) 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.