Phoenix Liau Trend Micro Cloud Computing Era (Practice)

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.

Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.

Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

HAMS Technologies 1

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:

Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –

Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Nov 2006 Google released the paper on BigTable.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.

HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.

Distributed Systems Lecture 3 Big Data and MapReduce 1.

Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

BIG DATA/ Hadoop Interview Questions.

Microsoft Ignite /28/2017 6:07 PM

Image taken from: slideshare

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Hadoopla: Microsoft and the Hadoop Ecosystem

Hadoop Clusters Tess Fulkerson.

Central Florida Business Intelligence User Group

Cloud Computing Era (Practice)

Ministry of Higher Education

Introduction to Spark.

Airlinecount CSCE 587 Fall 2017.

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Introduction to Apache

Overview of big data tools

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Pig Hive HBase Zookeeper

Presentation transcript:

Phoenix Liau Trend Micro Cloud Computing Era (Practice)

Three Major Trends to Chang the World Cloud Computing Mobile Big Data

什麼是雲端運算？以服務 (as-a-service) 的商業模式，透過 Internet 技術，提供具有擴充性 (scalable) 和彈性 (elastic) 的 IT 相關功能給使用者 Essential Characteristics Service Models Deployment Models 美國國家標準技術研究所 (NIST) 的定義 :

It’s About the Ecosystem IaaS PaaS SaaS Cloud Computing Generate Big Data Lead Business Insights create Competition, Innovation, Productivity Structured, Semi-structured Enterprise Data Warehouse

What is BigData? A set of filesA databaseA single file

What is the problem Getting the data to the processors becomes the bottleneck Quick calculation –Typical disk data transfer rate: 75MB/sec –Time taken to transfer 100GB of data to the processor: approx. 22 minutes!

The Era of Big Data – Are You Ready Businesses are driving the growth of big data. The capable data storage, efficient management, and capturing values to business values of huge size of data are enterprise big challenges. Overwhelming quantities of big data will challenge enterprise storage infrastructure and data center architecture which will cause chain reactions in database storage, data mining, business intelligence, cloud computing, and computing application. Data for business commercial analysis 2011: multi-terabyte (TB) 2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It? When to use? Affordable Storage/Compute Unstructured or Semi-structured Resilient Auto Scalability When to use? Ad-hoc Reporting (<1sec) Multi-step Transactions Lots of Inserts/Updates/Deletes Enterprise Database Hadoop

Hadoop!

– inspired by Apache Hadoop project –inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity – IT Costs Reduction

©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core HDFS MapReduce

©2011 Cloudera, Inc. All Rights Reserved. HDFS Hadoop Distributed File System Redundancy Fault Tolerant Scalable Self Healing Write Once, Read Many Times Java API Command Line Tool

©2011 Cloudera, Inc. All Rights Reserved. MapReduce 13 Two Phases of Functional Programming Redundancy Fault Tolerant Scalable Self Healing Java API

©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core 14 HDFS MapReduce Java

Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

The Hadoop Ecosystems

The Ecosystem is the System Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache

Relation Map MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Zookeeper – Coordination Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What is ZooKeeper A centralized service for maintaining –Configuration information –Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data –Status information –Configuration –Location information

Flume / Sqoop – Data Integration Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What’s the problem for data collection Data collection is currently a priori and ad hoc A priori – decide what you want to collect ahead of time Ad hoc – each kind of data source goes through its own collection path

(and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

Flume: High-Level Overview Logical Node Source Sink

©2011 Cloudera, Inc. All Rights Reserved. Flume Architecture Log Flume Node Log Flume Node... HDFS

©2011 Cloudera, Inc. All Rights Reserved. Flume Sources and Sinks Local Files HDFS Stdin, Stdout Twitter IRC IMAP

Sqoop Easy, parallel database import/export What you want do? –Insert data from RDBMS to HDFS –Export data from HDFS back into RDBMS

©2011 Cloudera, Inc. All Rights Reserved. Sqoop 28 RDBMS Sqoop HDFS

©2011 Cloudera, Inc. All Rights Reserved. Sqoop Examples 29 $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City... $ hadoop fs -cat City/part-m ,Kabul,AFG,Kabol, ,Qandahar,AFG,Qandahar, ,Herat,AFG,H erat, ,Mazar-e-Sharif,AFG,Balkh, ,Amsterdam,NLD,Noord- Holland,

Pig / Hive – Analytical Language MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce –Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by What is Hive? –An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop –MapRuduce for execution –HDFS for storage Hive Query Language –Basic-SQL : Select, From, Join, Group-By –Equi-Join, Muti-Table Insert, Multi-Group-By –Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

©2011 Cloudera, Inc. All Rights Reserved. Hive 33 MapReduce Hive SQL

Pig A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load ‘a.txt’ as (id, name, age,...) B = load ‘b.txt’ as (id, address,...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ A = load ‘a.txt’ as (id, name, age,...) B = load ‘b.txt’ as (id, address,...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ – Initiated by

©2011 Cloudera, Inc. All Rights Reserved. Pig MapReduce Pig Script

Hive vs. Pig

Input For the given sample input the map emits the reduce just sums up the values Hello World Bye World Hello Hadoop Goodbye Hadoop Hello World Bye World Hello Hadoop Goodbye Hadoop WordCount Example

WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

41 ©2011 Cloudera, Inc. All Rights Reserved. The Story So Far RDBMS HivePig Sqoop MapReduce HDFS FS SQL Script Posix Java Flume

Hbase – Column NoSQL DB MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Structured-data vs Raw-data

I – Inspired by Coordinated by Zookeeper Low Latency Random Reads And Writes Distributed Key/Value Store Simple API –PUT –GET –DELETE –SCANE

Hbase – Data Model Cells are “versioned” Table rows are sorted by row key Region – a row range [start-key:end-key]

Hbase – workflow

©2011 Cloudera, Inc. All Rights Reserved. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable'

Oozie – Job Workflow & Scheduling MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What is ? A Java Web Application Oozie is a workﬂow scheduler for Hadoop Crond for Hadoop Triggered –Time –Data Job 1 Job 3 Job 2 Job 4Job 5

©2011 Cloudera, Inc. All Rights Reserved. Oozie Features Component Independent –MapReduce –Hive –Pig –SqoopStreaming

Mahout – Data Mining MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster

©2011 Cloudera, Inc. All Rights Reserved. Mahout Use Cases Yahoo: Spam Detection Foursquare: Recommendations SpeedDate.com: Recommendations Adobe: User Targetting Amazon: Personalization Platform

Use case Example Predict what the user likes based on –His/Her historical behavior –Aggregate behavior of people similar to him

Conclusion Today, we introduced: Why Hadoop is needed The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem

Recap – Hadoop Ecosystem MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

趨勢科技雲端防毒 Case Study

Collaboration in the underground

網路威脅呈現爆炸性的成長各式各樣的變種病毒、垃圾郵件、不明的下載來源等等，這些來自網路上的威脅，躲過傳統安全防護系統的偵測，一直持續呈現爆炸性的成長，形成嚴重的資安威脅 New Unique Malware Discovered 1M unique Malwares every month 1M unique Malwares every month

New Design Concept for Threat Intelligence Web Crawler Trend Micro Endpoint Protection Trend Micro Endpoint Protection Trend Micro Mail Protection Trend Micro Mail Protection Trend Micro Web Protection Trend Micro Web Protection Honeypot CDN / xSP Human Intelligence 150M+ Worldwide Endpoints/Sensors

Challenges We Are Faced The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!

Raw Data Information Threat Intelligence/Solution Volume: Infinite Time: No Delay Target: Keep Changing Threats Issues to Address

SPN Feedback Log Receiver L4 Message Bus Log Receiver HBase MapReduce Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) CDN Log SPAM HTTP POST Web Pages HTTP Download Feedback Information SPN High Level Architecture Log Post Processing L4 Reputation Service SPN infrastructure Application Global Object Cache (GOC) Global Object Cache (GOC) Tracking Logging System (TLS) Tracking Logging System (TLS) Malware Classificati on Correlation Platform Web Reputation Service Web Reputation Service File Reputation Service File Reputation Service Lumber Jack Lumber Jack Adhoc-Query (Pig) Circus (Ambari) Circus (Ambari) Log Post Processing

Trend Micro Big Data process capacity 雲端防毒每日需要處理的資料量 85 億個 Web Reputation 查詢 30 億個 Reputation 查詢 70 億個 File Reputation 查詢處理 6 TB 從全世界收集到的 raw logs 來自 1.5 億台終端裝置的連線

Trend Micro: Web Reputation Services User Traffic | Honeypot Akamai Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis 8 billions/day 4.8 billions/day 860 millions/day 40% filtered 82% filtered 25,000 malicious URL /day 99.98% filtered Trend Micro Products / Technology CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling Machine Learning Data Mining TechnologyProcessOperation Block malicious URL within 15 minutes once it goes online! 15 Minutes

Big Data Cases

Line Data on HBase Line data –MODEL: -> –INDEX: -> User: ->, Consistency in HBase Contact model: use column qualifier to store Support range query (e.g. message box)

Pig at Linkedin

Linkedin - Pig Example views = LOAD '/data/awesome' USING VoldemortStorage(); views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1’)

Facebook Messages

Facebook Open Source Stack Memcached --> App Server Cache ▪ZooKeeper --> Small Data Coordination Service ▪HBase --> Database Storage Engine ▪HDFS --> Distributed FileSystem ▪Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

Thank you!