András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.

Slides:



Advertisements
Similar presentations
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Advertisements

Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Tyson Condie.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Learning with large datasets Machine Learning Large scale machine learning.
If BIG DATA is the answer, then what was the question?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Online Learning for Collaborative Filtering
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Adatmenedzsment Kihívások és válaszok Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences (MTA SZTAKI)
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Mining of Massive Datasets Edited based on Leskovec’s from
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Information Eastman. Business Process Skills Order to Cash, Forecasting & Budgeting, etc. Process Modeling Project Management Technical Skills.
Big Data Yuan Xue CS 292 Special topics on.
TWOJA CYFROWA PRZYSZŁOŚĆ. JUŻ DZISIAJ. Christoph F. Strnadl CTO Central & Eastern Europe 11 May 2016.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
Big Data – Lendület kutatócsoport Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences
Book web site:
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Big thanks to everyone!.
Sub-fields of computer science. Sub-fields of computer science.
Geoffrey Fox Panel Talk: February
Image taken from: slideshare
Big Data Analytics and HPC Platforms
Stanford University.
Data Mining – Intro.
PROTECT | OPTIMIZE | TRANSFORM
Introduction to Spark Streaming for Real Time data analysis
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
Connected Maintenance Solution
Introduction Characteristics Advantages Limitations
Learning Recommender Systems with Adaptive Regularization
Connected Maintenance Solution
R SE to the challenges of ntelligent systems
IEM Fall SEMINAR SERIES Raghu Pasupathy
Building a real time Tweet map with Flink in six weeks
CHAPTER 3 Architectures for Distributed Systems
MID-SEM REVIEW.
Big Data Analytics in Parallel Systems
Operationalize your data lake Accelerate business insight
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Neural Networks and Backpropagation
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
Data Warehousing and Data Mining
Collaborative Filtering Matrix Factorization Approach
David Gillman Collaborative Metrix
OMIS 665, Big Data Analytics
CS110: Discussion about Spark
Parallel Analytic Systems
Big Data Young Lee BUS 550.
Enabling ML Based Research
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Neural Networks Geoff Hulten.
CS639: Data Management for Data Science
Big DATA.
Panel on Research Challenges in Big Data
Data Analysis and R : Technology & Opportunity
CS 239 – Big Data Systems Fall 2018
Business Processes Associate Consultant - Supply Chain Planning - IBP
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Convergence of Big Data and Extreme Computing
Presentation transcript:

András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer Science and Control, Hungarian Academy of Sciences in collaboration with Volker Markl TU Berlin

Data Management vs. Data Analytics Data Management for traditional gather, access, search Analytics (Machine Learning) for insights, predictions

Deep Analytics needs

Twitter Example: Meryl Streep – Oscar, 2012

kép: Twitter Example: Meryl Streep – Oscar, 2012

kép: Twitter Example: Meryl Streep – Oscar, 2012

What was The Analytics Challenge? One year 1 billion Tweet collection, 100GB Ad Hoc queries (Meryl Streep) may have 100,000+ hits Fast response needed to support the analyst Solutions o In Memory databases (SAP HANA, …) – cost and physical limitations o Customized approximate data structures (Bloom filters, MinHash fingerprints)

Need for Networked Analytics

Information in interconnectivity Number and influnce, impressibility of followers, tweets Statistical properties on temporal dynamics and number of users reached by messages

Predictive Claims Processing Hungarian Insurance company cases Rule generation (incl. social media) Feature engineering Machine learning & alert generation Days since contract Known fraud Normal sample

3-4 transactions distance raise the flag

Need for Real Time Analytics

Software AND Human Latencies

Deep Analysis of Big Data is Key to Competitiveness

Data Science: Deep Analytics + Big Data

Data Scientist magic triangle Application Scalable Data Management Machine Learning, Statistics, Data Analysis Data Science Control Flow Iterative Algorithms Error Estimation Active Sampling Sketches Curse of Dimensionality Decoupling Convergence Monte Carlo Mathematical Programming Linear Algebra Stochastic Gradient Descent Regression Statistics Hashing Parallelization Query Optimization Fault Tolerance Relational Algebra / SQL Scalability Data Analysis Language Compiler Memory Management Memory Hierarchy Data Flow Hardware Adaptation Indexing Resource Management NF 2 /XQuery Data Warehouse/OLAP ML DM Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Real-Time

Apache Flink: the emerging European tool TU Berlin / DFKI (DE) SICS (SE) SZTAKI (HU)

Data Scientist Supply Chain

Registered teams - affiliation

Extra Slide: Fully Distributed Modeling Needs no central service – suitable for: o Ad hoc networks o Privacy requirements Model delta updates are sent to peers Results for applicability in: o Classification o Recommender Systems R  P 1 Q 1 R  P 2 (Q 2 +  Q) Measurement QQ QQ

Conclusions Software Latency o For data streaming solutions, we have to combine Batch pre-computed models updated real time (lambda architecture) Very low memory data approximation Carefully selected database operations to optimize communication o Machine learning, prediction, classification made Highly time sensitive, streaming Fully distributed: each element learns by passing model error to peers Human Latency o Shortage of Data Scientists worldwide o Needs training AND systems with reduced learning curve