Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell.

Slides:



Advertisements
Similar presentations
OLAP Query Processing in Grids
Advertisements

PingER Management1 Error Reporting Model for Ping End-to-End Reporting (PingER Management)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
ACL Solutions for Continuous Auditing and Monitoring John Verver CA, CISA, CMC Vice President, Professional Services & Product Strategy ACL Services Ltd.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Ch 4. The Evolution of Analytic Scalability
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG MADES - A Multi-Layered, Adaptive, Distributed Event Store Tilmann Rabl Mohammad Sadoghi Kaiwen Zhang Hans-Arno.
Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.
Data Profiling
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Sensor Database System Sultan Alhazmi
Mining High Utility Itemset in Big Data
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Large-scale Virtualization in the Emulab Network Testbed Mike Hibler, Robert Ricci, Leigh Stoller Jonathon Duerig Shashi Guruprasad, Tim Stack, Kirk Webb,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Nov 2006 Google released the paper on BigTable.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
Online School Management System Supervisor Name: Ashraful Islam Juwel Lecturer of Asian University of Bangladesh Submitted By: Bikash Chandra SutrodhorID.
June 12, 2016CITALA'121 Cloud Computing Technology For Large Scale and Efficient Arabic Handwriting Recognition System HAMDI Hassen, KHEMAKHEM Maher
ETERE A Cloud Archive System. Cloud Goals Create a distributed repository of AV content Allows distributed users to access.
A Practical Performance Analysis of Stream Reuse Techniques in Peer-to-Peer VoD Systems Leonardo B. Pinho and Claudio L. Amorim Parallel Computing Laboratory.
Manufacturing Productivity Solutions Management Metrics for Lean Manufacturing Companies Total Productive Maintenance (T.P.M.) Overall Equipment Effectivity.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Big Data & Test Automation
Image taken from: slideshare
Understanding and Improving Server Performance
Organizations Are Embracing New Opportunities
Integration of Oracle and Hadoop: hybrid databases affordable at scale
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Introduction to Load Balancing:
Integrating QlikView with MPP data sources
Cloud based linked data platform for Structural Engineering Experiment
INTRODUCTION TO BIGDATA & HADOOP
An Open Source Project Commonly Used for Processing Big Data Sets
Network Tools and Utilities
Distributed Network Traffic Feature Extraction for a Real-time IDS
Running virtualized Hadoop, does it make sense?
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Hadoop MapReduce Framework
Chapter 14 Big Data Analytics and NoSQL
Applying Control Theory to Stream Processing Systems
Data warehouse and OLAP
Incrementally Moving to the Cloud Using Biml
Grid Computing.
Introduction to HDFS: Hadoop Distributed File System
Introduction to Data Warehousing
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Unidad II Data Warehousing Interview Questions
Ch 4. The Evolution of Analytic Scalability
Project Information Management Jiwei Ma
Interpret the execution mode of SQL query in F1 Query paper
Zoie Barrett and Brian Lam
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Data Warehousing Concepts
Business Intelligence
Database System Architectures
Map Reduce, Types, Formats and Features
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell. SLAC National Accelerator Laboratory, Federal University of Rio de Janeiro, and Federal Rural University of Rio de Janeiro Supporting agencies:

Agenda Introduction Scenario Proposed Approach Experimental Evaluation Conclusion

Introduction

Motivation Measuring The Quality of The Internet is… Essential to evaluate the performance of the data links around the world Important to keep track of how countries have improved their connections throughout the years Answer questions like “What is the overall rate of Internet connection speed in the world?” But this is not a trivial task, because a large amount of data must be collected, treated and further analyzed.

PingER Project The project consists of… PingER… Gathering network data using the Ping facility Storing them into a centralized repository for future analyses Monitors the end-to-end performance of Internet links worldwide Preserves a vast data repository of network performance measurements from and to sites all around the world Every 30 minutes, pings containing two different packet sizes are sent from over 80 MAs (source nodes) to over 700 locations (destination nodes) spread out over 160 countries

Scenario / Problem Data from the MAs (source) are analyzed to extract 16 different metrics PingER measurement defined by a combination of several network parameters Thousands of small flat files totalizing more than 60Gb of data There is no management system of the data Hard to analyze and to access No way to realize structured queries

Proposed Approach

Data Selection Daily files – finest granularity to process From 1998 to 2014 All the 16 metrics More than 60Gb of data

Data Modeling

Data Transformation and Loading into the Data Warehouse MapReduce dataflow to be executed in SciCumulus SciCumulus -> a workflow management system Mapper Reducer Reads the PingER flat files Transforms the files following the dimensional data model Combine all files of a given year The distributed process is transparent. We can monitor the status of the execution and how each specific data transformation occurs

Solution

Experimental Evaluation

Running the Experiments What else? Cluster consisting of… 4 virtual machines 4 cores each 16Gb of RAM each 220Gb of Hard Drive Gigabit/sec LAN Linux RedHat 6.6 Cloudera Distribution 5.4.4 We first ran the data transformation process, then loaded the resulting files into Impala, and finally executed multiple analytical queries to analyze the data.

Some Numbers… Over 100,000 files transformed in 17 files. 45Gb of transformed files. Transformation process = 6h 12min 32 seconds on average to upload each file to the HDFS. Impala tables were created using Parquet format. Parquet is as a column-oriented binary file format that was created to be highly efficient for large-scale queries [7]. 0.28 seconds on average to create the tables 6.75 seconds on average to have the data inserted.

An Example Average RTT Throughput Packet loss Pakistan x World Nodes From 8/15/2007 to 8/31/2007. 19,08 seconds to run

Conclusion

Conclusion This work presented… A large-scale solution to deal with a huge amount of data in a data warehouse analytical environment. A data warehouse environment that showed to be quite efficient and capable of dealing with all the massive data from PingER project. Provides the facility to search the PingER database and filter on ways that the original PingER project was not capable of.

Future Work Publish the queries results using Linked Open Data standards following the PingER LOD Ontology by Souza et al. Make the big data platform accessible to the general public. Automatize the ETL process and keep the data up-to-date.

Any Questions? Also feel free to contact: thiago.marcos.13@gmail.com cottrell@slac.stanford.edu