Phoenix Liau Trend Micro Cloud Computing Era (Practice)

Three Major Trends to Chang the World Cloud Computing Mobile Big Data

什麼是雲端運算？以服務 (as-a-service) 的商業模式，透過 Internet 技術，提供具有擴充性 (scalable) 和彈性 (elastic) 的 IT 相關功能給使用者 Essential Characteristics Service Models Deployment Models 美國國家標準技術研究所 (NIST) 的定義 :

It’s About the Ecosystem IaaS PaaS SaaS Cloud Computing Generate Big Data Lead Business Insights create Competition, Innovation, Productivity Structured, Semi-structured Enterprise Data Warehouse

What is BigData? A set of filesA databaseA single file

What is the problem Getting the data to the processors becomes the bottleneck Quick calculation –Typical disk data transfer rate: 75MB/sec –Time taken to transfer 100GB of data to the processor: approx. 22 minutes!

The Era of Big Data – Are You Ready Businesses are driving the growth of big data. The capable data storage, efficient management, and capturing values to business values of huge size of data are enterprise big challenges. Overwhelming quantities of big data will challenge enterprise storage infrastructure and data center architecture which will cause chain reactions in database storage, data mining, business intelligence, cloud computing, and computing application. Data for business commercial analysis 2011: multi-terabyte (TB) 2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It? When to use? Affordable Storage/Compute Unstructured or Semi-structured Resilient Auto Scalability When to use? Ad-hoc Reporting (<1sec) Multi-step Transactions Lots of Inserts/Updates/Deletes Enterprise Database Hadoop

Hadoop!

– inspired by Apache Hadoop project –inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity – IT Costs Reduction

Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

The Hadoop Ecosystems

The Ecosystem is the System Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache

Relation Map MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Zookeeper – Coordination Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What is ZooKeeper A centralized service for maintaining –Configuration information –Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data –Status information –Configuration –Location information

Flume / Sqoop – Data Integration Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What’s the problem for data collection Data collection is currently a priori and ad hoc A priori – decide what you want to collect ahead of time Ad hoc – each kind of data source goes through its own collection path

(and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

Flume: High-Level Overview Logical Node Source Sink

Sqoop Easy, parallel database import/export What you want do? –Insert data from RDBMS to HDFS –Export data from HDFS back into RDBMS

©2011 Cloudera, Inc. All Rights Reserved. Sqoop Examples 29 $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AFG,H erat,1868004,Mazar-e-Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord- Holland,731200...

Pig / Hive – Analytical Language MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce –Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by What is Hive? –An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop –MapRuduce for execution –HDFS for storage Hive Query Language –Basic-SQL : Select, From, Join, Group-By –Equi-Join, Muti-Table Insert, Multi-Group-By –Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Pig A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load ‘a.txt’ as (id, name, age,...) B = load ‘b.txt’ as (id, address,...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ A = load ‘a.txt’ as (id, name, age,...) B = load ‘b.txt’ as (id, address,...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ – Initiated by

Hive vs. Pig

Input For the given sample input the map emits the reduce just sums up the values Hello World Bye World Hello Hadoop Goodbye Hadoop Hello World Bye World Hello Hadoop Goodbye Hadoop WordCount Example

WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

Hbase – Column NoSQL DB MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Structured-data vs Raw-data

I – Inspired by Coordinated by Zookeeper Low Latency Random Reads And Writes Distributed Key/Value Store Simple API –PUT –GET –DELETE –SCANE

Hbase – Data Model Cells are “versioned” Table rows are sorted by row key Region – a row range [start-key:end-key]

Hbase – workflow

©2011 Cloudera, Inc. All Rights Reserved. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable'

Oozie – Job Workflow & Scheduling MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What is ? A Java Web Application Oozie is a workﬂow scheduler for Hadoop Crond for Hadoop Triggered –Time –Data Job 1 Job 3 Job 2 Job 4Job 5

Mahout – Data Mining MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster

Use case Example Predict what the user likes based on –His/Her historical behavior –Aggregate behavior of people similar to him

Conclusion Today, we introduced: Why Hadoop is needed The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem

Recap – Hadoop Ecosystem MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

趨勢科技雲端防毒 Case Study

Collaboration in the underground

網路威脅呈現爆炸性的成長各式各樣的變種病毒、垃圾郵件、不明的下載來源等等，這些來自網路上的威脅，躲過傳統安全防護系統的偵測，一直持續呈現爆炸性的成長，形成嚴重的資安威脅 New Unique Malware Discovered 1M unique Malwares every month 1M unique Malwares every month

New Design Concept for Threat Intelligence Web Crawler Trend Micro Endpoint Protection Trend Micro Endpoint Protection Trend Micro Mail Protection Trend Micro Mail Protection Trend Micro Web Protection Trend Micro Web Protection Honeypot CDN / xSP Human Intelligence 150M+ Worldwide Endpoints/Sensors

Challenges We Are Faced The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!

Raw Data Information Threat Intelligence/Solution Volume: Infinite Time: No Delay Target: Keep Changing Threats Issues to Address

SPN Feedback Log Receiver L4 Message Bus Log Receiver HBase MapReduce Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) CDN Log SPAM HTTP POST Web Pages HTTP Download Feedback Information SPN High Level Architecture Log Post Processing L4 Email Reputation Service SPN infrastructure Application Global Object Cache (GOC) Global Object Cache (GOC) Tracking Logging System (TLS) Tracking Logging System (TLS) Malware Classificati on Correlation Platform Web Reputation Service Web Reputation Service File Reputation Service File Reputation Service Lumber Jack Lumber Jack Adhoc-Query (Pig) Circus (Ambari) Circus (Ambari) Log Post Processing

Trend Micro Big Data process capacity 雲端防毒每日需要處理的資料量 85 億個 Web Reputation 查詢 30 億個 Email Reputation 查詢 70 億個 File Reputation 查詢處理 6 TB 從全世界收集到的 raw logs 來自 1.5 億台終端裝置的連線

Trend Micro: Web Reputation Services User Traffic | Honeypot Akamai Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis 8 billions/day 4.8 billions/day 860 millions/day 40% filtered 82% filtered 25,000 malicious URL /day 99.98% filtered Trend Micro Products / Technology CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling Machine Learning Data Mining TechnologyProcessOperation Block malicious URL within 15 minutes once it goes online! 15 Minutes

Big Data Cases

Line Data on HBase Line data –MODEL: -> –INDEX: -> User: ->, Consistency in HBase Contact model: use column qualifier to store Support range query (e.g. message box)

Pig at Linkedin

Linkedin - Pig Example views = LOAD '/data/awesome' USING VoldemortStorage(); views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1’)

Facebook Messages

Facebook Open Source Stack Memcached --> App Server Cache ▪ZooKeeper --> Small Data Coordination Service ▪HBase --> Database Storage Engine ▪HDFS --> Distributed FileSystem ▪Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

Thank you!

Phoenix Liau Trend Micro Cloud Computing Era (Practice)

Similar presentations

Presentation on theme: "Phoenix Liau Trend Micro Cloud Computing Era (Practice)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phoenix Liau Trend Micro Cloud Computing Era (Practice)

Similar presentations

Presentation on theme: "Phoenix Liau Trend Micro Cloud Computing Era (Practice)"— Presentation transcript:

Similar presentations

About project

Feedback