Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch 1

Why Big Data? because bigger is smarter – answer tough questions because we can – push the limits and good things will happen 2

bigger = smarter? Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better 3

bigger = smarter? Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better But! – more data, more error (e.g., semantic heterogeneity) – with enough data you can prove anything – still need humans to ask right questions 4

Fundamental Problem of Big Data There is no ground truth – gets more complicated with self-fulfilling prophecies e.g., stock market predictions change behavior of people e.g., Web search engines determine behavior of people 5

Fundamental Problem of Big Data There is no ground truth – gets more complicated with self-fulfilling prophecies Hard to debug: takes human out of the loop – Example: How to play lottery in Napoli Step 1: You visit “oracles” who predict numbers to play Step 2: You visit “interpreters” who explain predictions Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista] 6

Why Big Data? because bigger is smarter – answer tough questions because we can – push the limits and good things will happen 7

Because we can… Really? Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel 8

Because we can… Really? Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel But, – data grows faster than energy on chip – value / cost tradeoff unknown – ownership of data unclear (aggregate vs. individual) I believe that all these “but’s” can be addressed 9

Utiliy & Cost Functions of Data Noise / Error Utility Cost 10

Utiliy & Cost Functions of Data Noise / Error Utility Cost curated random malicious curated random malicious 11

Best Utility/Cost Tradeoff Noise / Error Utility Cost malicious 12

What is good enough? Noise / Error Utility Cost curated 13

What about platforms? Relational Databases – great for 20% of the data – not great for 80% of the data Hadoop – great for nothing – good enough for (almost) everything (if tweaked) 14

Why is Hadoop so popular? availability: open source and free proven technology: nothing new & simple works for all data and queries branding: the big guys use it it has the right abstractions – MR abstracts “counting” (= machine learning) it is an eco-system - it is NOT a platform – HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … – relational database systems – turned into a platform depending on app / problem 15

Example: Amadeus Log Service HDFS for compressed logs HBase to index by timestamp and session id SOLR for full text search Hadoop (MR) for usage stats & disasters Oracle to store meta-data (e.g., user information) Disclaimer: under construction & evaluation!!! – current production system is proprietary 16

Some things Hadoop got wrong? performance: huge start-up time & overheads productivity: e.g., joins, configuration knobs SLAs: no response time guarantees, no real time Essentially ignored 40 years of DB research  17

Some things Hadoop got right scales without (much) thinking moves the computation to the data fault tolerance, load balance, … 18

How to improve on Hadoop Option 1: Push our knowledge into Hadoop? – implement joins, recursion, … Option 2: Push Hadoop into RDBMS? – build a Hadoop-enabled database system Option 3: Build new Hadoop components – real-time, etc. Option 4: Patterns to compose components – log service, machine learning, … – but, do not build a “super-Hadoop” 19

Conclusion Focus on “because we can…” part – help data scientists to make everything work Stick to our guns – develop clever algorithms & data structures – develop modeling tools and languages – develop abstractions for data, errors, failures, … – develop “glue”; get the plumbing right Package our results correctly – find the right abstractions (=> APIs of building blocks) 20

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.

Similar presentations

Presentation on theme: "Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.

Similar presentations

Presentation on theme: "Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1."— Presentation transcript:

Similar presentations

About project

Feedback