Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.

Slides:



Advertisements
Similar presentations
Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
Advertisements

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Stairway to the cloud or can we take the highway? Taivo Liik.
Nov 2006 Google released the paper on BigTable.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Zhangxi Lin Texas Tech University
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
This is a free Course Available on Hadoop-Skills.com.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Microsoft Partner since 2011
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Connected Infrastructure
Data Platform and Analytics Foundational Training
Big Data is a Big Deal!.
Big Data Enterprise Patterns
Yarn.
Big Data Technology.
Hadoopla: Microsoft and the Hadoop Ecosystem
Connected Infrastructure
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Introduction to Apache
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Architecture of modern data warehouse
Big Data.
Presentation transcript:

Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013

Confidential Netmagic Internal Use Only Hadoop - A Prelude 2

Confidential Netmagic Internal Use Only Apache Project and Animal Friendly names Some of the Projects under Apache Foundation to mention: 3 Apache Zookeeper Apache Tomcat Apache Pig

Confidential Netmagic Internal Use Only And now Hadoop

Confidential Netmagic Internal Use Only Hadoop – The Name 5

Confidential Netmagic Internal Use Only Hadoop – The Relevance 6 Apache Zookeeper Two Important things to know when discussing Big Data ● MapReduce ● Hadoop.

Confidential Netmagic Internal Use Only Hadoop – How was it Born? ● To Process Huge Volume of data, as the amount of generated data continued to rapidly increase. (Big Data). ● Also the Web generated more and more information, which was becoming quite challenging to index the content. 7 Apache Zookeeper Apache Tomcat

Confidential Netmagic Internal Use Only Hadoop – The Reality Vs Myth Hadoop is not a direct replacement for enterprise data warehouses, data marts and other data stores that are commonly used to manage structured or transactional data. It is used to augment enterprise data architectures by providing an efficient and cost-effective means for storing, processing, managing and analyzing the ever-increasing volumes of semi-structured or un-structured data. Hadoop is useful across virtually every vertical industry. 8 Apache Zookeeper Apache Tomcat

Confidential Netmagic Internal Use Only Hadoop – Some Use Cases Digital marketing automationLog Analysis and Event CorrelationFraud detection and preventionPredictive modeling for new drugsSocial network and relationship analysis Perform ETL ( Extract Transform Load ) functions on unstructured data Image Correlation and AnalysisCollaborative Filtering 9 Apache Tomcat

Confidential Netmagic Internal Use Only Hadoop – What do we expect from it ? If we analyze the mentioned use cases, we realize that 10 The data is coming in varied formats and from varied sources. Need to handle incoming stream of data in real time and also process it, sometimes in real-time. Need a connect to their existing RDBMS Need for a Distributed File SystemNeed capability for Data Warehousing over and above the processed data Need “Map Only” capability to perform Image matching and correlation Need for a Scalable database Growing need for a GUI to operate and Develop Applications for HadoopNeed for a Framework for Parallel Compute Need for a Distributed Computing EnvironmentNeed for a Machine Learning and Data Mining requirements Almost all of the workloads have a need to Manage data processing Jobs

Confidential Netmagic Internal Use Only Hadoop – Components which come to the rescue 11 Apache Zookeeper Apache Tomcat HDFS – Distributed File System Mahout MapReduce – Distributed Processing of large Data sets ZooKeeper – Co-ordination Service for Dist Apps HBase – Scalable Distributed DB. Supports Structured Data Avro – Data Serialization System SQOOP – Connector to Structured Database Chukwa –To monitor Large Distributed System Flume – To move Large Data post processing efficiently Hue – GUI to operate & develop Hadoop Applications Hive – Data Warehousing framework Many more …. Pig – Framework for Parallel Computation Oozie – Workflow Service to manage Data Processing Jobs

Confidential Netmagic Internal Use Only Hadoop – Who’s Using It ? 12 Apache Zookeeper Apache Tomcat Uses Hadoop and HBase for : Social services Structured data storage Processing for internal use Uses Hadoop for : Amazon's product search indices They process millions of sessions daily for analytics. Uses Hadoop for : Search optimization Research Uses Hadoop for : Databasing and analyzing Next Generation Sequencing (NGS) data produced for the Cancer Genome Atlas (TCGA) project and other groups Uses Hadoop for : Internal log reporting/parsing systems designed to scale to infinity and beyond. web-wide analytics platform Uses Hadoop : As a source for reporting/analytics and machine learning. And Many More ….

Confidential Netmagic Internal Use Only Hadoop – The Various Forms Today 13 Apache Zookeeper Apache Tomcat Apache Hadoop – Native Hadoop Distribution from Apache FoundationYahoo! Hadoop – Hadoop Distribution of YahooCDH – Hadoop Distribution from ClouderaGreenPlum Hadoop – Hadoop Distribution from EMCHDP – Hadoop Platform from HortonworksM3 / M5 / M7 – Hadoop Distribution from MAPRProject Serengeti – Vmware’s Implementation of Hadoop on VcenterAnd More …

Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing ● Some of the Practical Use cases for Log Processing Generally in use today : Assuming a situation we have Huge Log’s generated for a period of time ranging in TB’s and we want to know : 14 Apache Tomcat Analytics – Application / Web Site PerformanceReporting – Page views, User sessionsEvent Detection & CorrelationPage views / User sessions Weekly / MonthlyUsers and their Behavioral PatternInvestigate IP and its behavioral pattern

Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing In the Conventional Method : Parallelism is on a per file basis and not on a Single file. 15 Apache Tomcat Log file - 1Log file - 2Log file - n Task - 1 grep [pattern] awk Task - 2 grep [pattern] awk Task - n grep [pattern] awk Final Data Set Concatenate Data Set Task - new

Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing With Map Reduce: 16 Apache Tomcat Log file - 1– Chunk-1Log file – 1 – Chunk - 2Log file - 1 – Chunk- n Task - 1 grep [pattern] awk Task - 2 grep [pattern] awk Task - n grep [pattern] awk Resultant Data Set

Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing ● Infrastructure realities in Conventional Method : ● How things Change With Map Reduce ● Assuming ● Single Disk can transfer data at the speed of 75MB/Sec ● If we consider a Hadoop Cluster of 4000 Nodes and each Server of 6 Disks each. ● The overall Throughput of the Setup would be 17 Apache Tomcat 1 Server with a 1Gbps NIC – Can Copy 100GB file in 14 Minutes 1 Server with 1 Disk can Typically copy a 100GB file in about 20 to 25 minutes. The Network Bottleneck is eliminated as we see multiple Servers with 1 Gbps NIC reading the same 100GB Data in smaller chunks each. The Disk Bottleneck is eliminated since each individual Server would have multiple Disks and underlying RAID to improve the Disk performance. = 6 * 75 * 4000 = approx 1.8 TB/secSo in result for 1PB of data to be read it would approx take 10 min’s.

Confidential Netmagic Internal Use Only Hadoop – Big Data Integration Challenges 18 Apache Tomcat Technology / Tools. A successful big data initiative requires acquiring, integrating and managing several big data technologies such as Hadoop, MapReduce, NoSQL databases, Pig, Scoop, Hive, Oozie and others. Conventional data management tools fail when trying to integrate, search and analyze big datasets, which range from terabytes to multiple Petabytes of information. People. As with any new technology, staff needs to be trained in big data technologies to learn proper skills and best practices. The two biggest challenges are : Finding in-house expertise, Allocation of sufficient budget, time and resources. Processes. Being a niche area not many documented procedures and processes are available.. Also depending upon the Application Use case, requirements change.

Confidential Netmagic Internal Use Only Hadoop – Native Solutions & Challenges Inherent Knowledge of the various Components and their dependencies is required. Configuration and implementation needs specific skills to not only implement but also to manage. Dependency of Data Scientist on the backend Programming Team. Any version upgrades etc need to be tested thoroughly before upgrading the current setup. Support Model is only through community based support and can lead to issues for an enterprise implementing Hadoop. Any integration and problems arising out of that can become a show stopper. 19 Apache Zookeeper Apache Tomcat

Confidential Netmagic Internal Use Only Hadoop – Advantages of Commercial Solutions Comes fully Integrated as a Package and Documented.Implementation is a straightforward activityCome with a Configuration manager which can help quickly setup the infrastructure.Give a great and easy connect to Enterprise Applications / Architecture. Some of them come with GUI capabilities to eliminate most of the programming requirements. Thus giving the control to the “Data Scientists” all by themselves. Come with a lot of Add-on Capabilities including the GUI for Management. Most of these Commercial editions work closely with the “Apache Foundation” and hence are compatible. It is pre-tested and hence the dependencies of the packages and their version changes etc is assured with the Distribution. 20 Apache Zookeeper Apache Tomcat

Confidential Netmagic Internal Use Only Hadoop – Commercial Solutions For Hadoop The Solutions Fit into 2 Categories : ● Infrastructure Automation ● Application Automation 21 Apache Zookeeper Apache Tomcat Infrastructure Automation Cloudera Infrastructure Automation HortonWorks Application Automation Karmasphere Studio Application Automation Talend Application Automation Pentaho These are just some of them.

Confidential Netmagic Internal Use Only Gartner Report – Magic Quadrant for Data Integration Tools 22 Apache Zookeeper Apache Tomcat

Confidential Netmagic Internal Use Only Hadoop & Cloud – Hand in Hand ? What Advantages does Cloud Bring in : Thus Hadoop going on Cloud does bring in the above advantages on the table to the Enterprises. All the Commercial Distributions available today, do offer a Virtual image option to deploy on Cloud / Virtualization Platform. Virtualization Solution Providers like vmware have come up with Project “Serengeti” to Support Quick Deployment and Management of Hadoop on Cloud. Cloud Service providers like Amazon, Netmagic and others have a deployment option of Hadoop Infrastructure on Cloud. 23 Apache Zookeeper Apache Tomcat Reduced Physical InfrastructureQuick Deployment using the Cloud Cloning / Templates.Elasticity Auto-Scaling capabilities of the Cloud to Spawn / De-spawn instances as and when required.

Confidential Netmagic Internal Use Only Insert your image here Contact Details For related queries/ feedback, mail to

Confidential Netmagic Internal Use Only Thank You

companies/netmagic NetmagicSolutions /user/netmagicsolutions