Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.

Similar presentations


Presentation on theme: "Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1."— Presentation transcript:

1 Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch 1

2 Why Big Data? because bigger is smarter – answer tough questions because we can – push the limits and good things will happen 2

3 bigger = smarter? Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better 3

4 bigger = smarter? Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better But! – more data, more error (e.g., semantic heterogeneity) – with enough data you can prove anything – still need humans to ask right questions 4

5 Fundamental Problem of Big Data There is no ground truth – gets more complicated with self-fulfilling prophecies e.g., stock market predictions change behavior of people e.g., Web search engines determine behavior of people 5

6 Fundamental Problem of Big Data There is no ground truth – gets more complicated with self-fulfilling prophecies Hard to debug: takes human out of the loop – Example: How to play lottery in Napoli Step 1: You visit “oracles” who predict numbers to play Step 2: You visit “interpreters” who explain predictions Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista] 6

7 Why Big Data? because bigger is smarter – answer tough questions because we can – push the limits and good things will happen 7

8 Because we can… Really? Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel 8

9 Because we can… Really? Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel But, – data grows faster than energy on chip – value / cost tradeoff unknown – ownership of data unclear (aggregate vs. individual) I believe that all these “but’s” can be addressed 9

10 Utiliy & Cost Functions of Data Noise / Error Utility Cost 10

11 Utiliy & Cost Functions of Data Noise / Error Utility Cost curated random malicious curated random malicious 11

12 Best Utility/Cost Tradeoff Noise / Error Utility Cost malicious 12

13 What is good enough? Noise / Error Utility Cost curated 13

14 What about platforms? Relational Databases – great for 20% of the data – not great for 80% of the data Hadoop – great for nothing – good enough for (almost) everything (if tweaked) 14

15 Why is Hadoop so popular? availability: open source and free proven technology: nothing new & simple works for all data and queries branding: the big guys use it it has the right abstractions – MR abstracts “counting” (= machine learning) it is an eco-system - it is NOT a platform – HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … – relational database systems – turned into a platform depending on app / problem 15

16 Example: Amadeus Log Service HDFS for compressed logs HBase to index by timestamp and session id SOLR for full text search Hadoop (MR) for usage stats & disasters Oracle to store meta-data (e.g., user information) Disclaimer: under construction & evaluation!!! – current production system is proprietary 16

17 Some things Hadoop got wrong? performance: huge start-up time & overheads productivity: e.g., joins, configuration knobs SLAs: no response time guarantees, no real time Essentially ignored 40 years of DB research  17

18 Some things Hadoop got right scales without (much) thinking moves the computation to the data fault tolerance, load balance, … 18

19 How to improve on Hadoop Option 1: Push our knowledge into Hadoop? – implement joins, recursion, … Option 2: Push Hadoop into RDBMS? – build a Hadoop-enabled database system Option 3: Build new Hadoop components – real-time, etc. Option 4: Patterns to compose components – log service, machine learning, … – but, do not build a “super-Hadoop” 19

20 Conclusion Focus on “because we can…” part – help data scientists to make everything work Stick to our guns – develop clever algorithms & data structures – develop modeling tools and languages – develop abstractions for data, errors, failures, … – develop “glue”; get the plumbing right Package our results correctly – find the right abstractions (=> APIs of building blocks) 20


Download ppt "Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1."

Similar presentations


Ads by Google