From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,

Slides:



Advertisements
Similar presentations
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Advertisements

A Fast Growing Market. Interesting New Players Lyzasoft.
One Billion Rows Per Second: Analytics for the Digital Media Markets XLDB October 19, 2011 MICHAEL DRISCOLL CO-FOUNDER &
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Amadeus Travel Intelligence ‘Monetising’ big data sets
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
What is Big Query?.
Nov 2006 Google released the paper on BigTable.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Big Data Yuan Xue CS 292 Special topics on.
Course : Study of Digital Convergence. Name : Srijana Acharya. Student ID : Date : 11/28/2014. Big Data Analytics and the Telco : How Telcos.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Data Analytics (CS40003) Introduction to Data Lecture #1
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Connected Infrastructure
Organizations Are Embracing New Opportunities
Data Platform and Analytics Foundational Training
Business Discovery, Monitoring & Reporting Data Flow iCLM UI Operator Systems OCS IN CDR PCC CRM Marketing Operations CSR Monitoring Marketing Integration.
Big Data Enterprise Patterns
PROTECT | OPTIMIZE | TRANSFORM
Connected Living Connected Living What to look for Architecture
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Smart Building Solution
Connected Maintenance Solution
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Intro to BI Architecture| Warren Sifre
CS122B: Projects in Databases and Web Applications Winter 2017
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Collecting heterogeneous data into a central repository
Smart Building Solution
Operational & Analytical Database
Connected Maintenance Solution
Connected Living Connected Living What to look for Architecture
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
IBM DATASTAGE online Training at GoLogica
Powering real-time analytics on Xfinity using Kudu
Future Data Architecture Cloud Hosting at USGS
Storage Systems for Managing Voluminous Data
This meme comes from South Park (S2E )
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
XtremeData on the Microsoft Azure Cloud Platform:
Introduction to SAP HANA
Big Data Young Lee BUS 550.
Technical Capabilities
McGraw-Hill Technology Education
Performance And Scalability In Oracle9i And SQL Server 2000
Big DATA.
Big-Data Analytics with Azure HDInsight
build a real time operational data lake in minutes.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Architecture of modern data warehouse
Presentation transcript:

From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark, Robert Wakeling Xiao-Jun Zeng, John Keane Project supported by EPSRC IAA 1

Introduction The context: Wadaro Limited – leading provider of quality of experience (QoE) monitoring and performance analysis solutions for mobile networks. The challenge: – provide real-time QoE monitoring of many millions of mobile network customers. The aim: – develop big data analytics based on existing bottom-up dynamic hierarchical structure – real-time performance analysis and prediction for any size of mobile network. IntroductionSetup & ChallengesThe solutionConclusions Data Science Club, 14th July 20162

Use case: Wadaro Wadaro Limited – SIM based QoE (SIM Applet) – gather Key Performance Indicators (KPIs) from subscribers – service performance from the user perspective Usage scenarios – network coverage benchmark – importance of cells based on coverage – benchmark customer experience of devices Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions 3

Current Setup & Challenges – Current setup Linux LAMP (MySQL) In-House and Hosted – Challenges: Growing database Sluggish query performance Overnight aggregations Added complexity Predicted exponential increase in in-coming data Little capacity to run data analytics, whilst maintaining operational support New approach needed for data ingestion and analytics! Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions 4

New architecture expectations – Scalable data ingestion: Capable of handling millions of devices – Scalable, cheap storage Terabytes of data – Scalable, fast analytics engine SQL compatibility Avoid overnight aggregations Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions 5

Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions 6

Scalable data ingest: Flume Data Science Club, 14th July 2016 Ingesting data into Hadoop Advantages – distributed service for getting the data in the cluster – captures and processes data asynchronously – act as a mediator between data producers and the data store. Cons – the complexity of writing custom agents IntroductionSetup & ChallengesThe solutionConclusions 7

Scalable data ingest: Flume Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions 8

Scalable data analytics: IBM BigSQL Data Science Club, 14th July 2016 Massively parallel processing engine Advantages – fully ANSI SQL compliant (DB2) – audited TPC benchmarks shows Big SQL being faster than competitors Hive and Impala – enterprise ready – compatible with BI tools: such as Tableau Cons – specific to the IBM 9 IntroductionSetup & ChallengesThe solutionConclusions

Scalable data ingest: BigSQL Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions 10

Scalable data analytics: IBM BigSQL Data Science Club, 14th July 2016 Technical issues encounter – issues with MySQL compatibility DB2 specific commands – lacking documentation few examples, low community support – no update/delete support – complex architecture new skills, expertise required IntroductionSetup & ChallengesThe solutionConclusions 11

Scalable data analytics: IBM BigQL Data Science Club, 14th July 2016 Key development process insights: Query type, database analysis: – data partitioning granularity improve query response times and efficiency (for example by day, month, year) – storage format (Parquet, Avro etc) – SQL or No-SQL approach? Table scans, aggregations -> columnar, Hive Frequent, small data lookup -> columnar, Hbase What if you need to do both? IntroductionSetup & ChallengesThe solutionConclusions 12

Results Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions – Experimental setup: 70 Gb, 152 million rows 1 hosted MySQL on top end server 3 In-house cheap desktops – Key results Best case: 45 minutes queries 10 seconds Worst case: 3 hour queries 5 minutes 2-3 orders of magnitude improvement 13

Conclusions Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions Is the switch from RDBMS to Hadoop worth it? Significant cost reduction Scalable data ingest Scalable, cheap storage Scalable data analytics Opens up data for future analysis Commercially, it offers the possibility of expansion into massive markets. prepared for business evolution 14

Future work Data Science Club, 14th July 2016 IntroductionSetup & ChallengesThe solutionConclusions Stream analysis Model the time-varying and interacting relationships between raw streaming data and various network performance indicators Online learning algorithms Feedback the knowledge into the network ML: classification, regression, clustering 15

Thank you for your attention! Questions? IntroductionExperimental setupControlConclusions Data Science Club, 14th July