Network Traffic Analysis using HADOOP Architecture Shan Zeng HEPiX, Beijing 17 Oct 2012.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Vytautas Krakauskas LITNET CERT Swedbank SIRT
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Module 13: Network Load Balancing Fundamentals. Server Availability and Scalability Overview Windows Network Load Balancing Configuring Windows Network.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HDFS Hadoop Distributed File System
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Programming in Hadoop Guangda HU Huayang GUO
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
HDFS MapReduce Hadoop  Hadoop Distributed File System (HDFS)  An open-source implementation of GFS  has many similarities with distributed file.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Distributed File Systems Sun Network File Systems Andrew Fıle System CODA File System Plan 9 xFS SFS Hadoop.
Chapter 4: server services. The Complete Guide to Linux System Administration2 Objectives Configure network interfaces using command- line and graphical.
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Distributed Network Traffic Feature Extraction for a Real-time IDS
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Information Technology Ms. Abeer Helwa
Apache Hadoop and Spark
Presentation transcript:

Network Traffic Analysis using HADOOP Architecture Shan Zeng HEPiX, Beijing 17 Oct 2012

ZENG SHAN/CC/IHEP Outline Introduction to Hadoop Introduction to Hadoop Traffic Information Capture Traffic Information Capture Traffic Information Resolution Traffic Information Resolution Traffic Information Storage Traffic Information Storage Traffic Information Analysis Traffic Information Analysis Traffic Information Display Traffic Information Display Conclusion Conclusion

Introduction to Hadoop ZENG SHAN/CC/IHEP

What can Hadoop do? Hadoop is an open-source software framework. It was originally developed to support distribution for the Nutch search engine project. Supports data-intensive distributed applications. Licensed under the Apache v2 license. It enables applications to work with thousands of computation-independent computers and petabytes of data

Traffic Information Capture ZENG SHAN/CC/IHEP

What is a flow? Network flow is a sequence of packets From a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. A network flow measures sequences of IP packets sharing a common feature as they pass between devices. Flow format: NetFlow(Cisco) J-Flow(Juniper) Sflow(HP) ….

ZENG SHAN/CC/IHEP What is nProbe? nProbe is an open source tools Capture packets flowing on a Ethernet segment, computes flows and export them to the specified collectors. Features: Ability to keep up with Gbit speeds on Ethernet networks handling thousand of packets per second without packet sampling on commodity hardware. Support for major OS including Unix, Windows and Mac OS X it is designed for environments with limited resources

Traffic Information Resolution ZENG SHAN/CC/IHEP

nfcapd nfcapd is the netflow capture daemon, it reads netflow data from the network and stores it into files. The output file is automatically rotated and renamed every n minutes - typically 5 min e.g. nfcapd contains the data from May 3rd :00 onward Usage: /usr/local/bin/nfcapd -p t 300 -l /home/zengshan/netflow/nfcapd_file/IHEP & -p portnum Specifies the port number to listen. Default port is t interval Specifies the time interval in seconds to rotate files. -l base_directory Specifies the base directory to store the output files. -l base_directory Specifies the base directory to store the output files.

ZENG SHAN/CC/IHEP nfdump nfdump Reads the netflow data from the files stored by nfcapd And then dump them to text

ZENG SHAN/CC/IHEP nfdump output text format TagDescriptionTagDescription %tsStart Time - first seen%in Input Interface num %te End Time - last seen%out Output Interface num %tdDuration %pkt Packets counts in this flow %prProtocol %byt Bytes count in this flow %saSource Address %fl Number of flows. %daDestination Address%flgTCP Flags %sapSource Address: Port%tosType of Service %dapDestination Address:Port%bps bits per second %sp Source Port%ppspackets per second %dpDestination Port%bpp Bytes per package %sasSource AS%dasDestination AS

ZENG SHAN/CC/IHEP

Traffic Information Storage ZENG SHAN/CC/IHEP

HDFS HDFS is short for Hadoop Distributed File System HDFS can provide high throughput access to application data Differences from other distributed file systems: highly fault-tolerant designed to be deployed on low-cost hardware. Portability across heterogeneous hardware and software platforms Applications run on HDFS need streaming access to their data sets provides high throughput access to application data suitable for applications that have large data sets Moving computation is cheaper than moving data

ZENG SHAN/CC/IHEP HDFS master/slave architecture NameNode Manages name space of the file system Regulates access to files by clients Determine the mapping of blocks to DataNodes DataNodes Responsible for serving read and write requests from the clients Perform block creation, deletion and replication upon the instructions from NameNode

ZENG SHAN/CC/IHEP Data Replication in HDFS To ensure the fault tolerance in HDFS, the blocks of a file are replicated, the replicas of a block can be specified by the application

Traffic Information Analysis ZENG SHAN/CC/IHEP

Map/Reduce MapReduce is a programming model for processing large data sets MapReduce is typically used to do distributed computing on clusters of computers MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The model is inspired by the map and reduce functions "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node."Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output

Traffic Information Display ZENG SHAN/CC/IHEP

Drawing tools RRDtool acronym for round-robin database tool The data are stored in a round-robin database(circular buffer) It also includes tools to extract RRD data in a graphical format drawing flow trend graph in three dimensionality: Flow count, Packet count, Traffic count Highstock Highstock lets you create stock or general timeline charts in pure JavaScript including sophisticated navigation options like a small navigator series, preset date ranges, date picker, scrolling and panning. We just need to write API between HDFS and Highstock

ZENG SHAN/CC/IHEP Architecture

Conclusion Network flow trend chart from IHEP every 5 minutes in three dimension: Flow count/Packet count/Traffic count Detailed traffic information chart select a single timeslot and get the detailed traffic information select a time window and get the detailed traffic information on hovering the chart, a tooltip text with traffic information on each point and series can be displayed. the tooltip follows as the user moves the mouse over the graph Traffic information related to the HEP experiments Once the IP addresses of the machines related to the data transferring of the HEP experiment is specified we already have DYB/YBJ/CMS/ALTAS

ZENG SHAN/CC/IHEP Thank You Questions?