Download presentation
Presentation is loading. Please wait.
Published byAnthony White Modified over 9 years ago
1
Phoenix Liau Trend Micro Cloud Computing Era (Practice)
2
Three Major Trends to Chang the World Cloud Computing Mobile Big Data
3
什麼是雲端運算? 以服務 (as-a-service) 的商業模式,透過 Internet 技術,提供具有擴充性 (scalable) 和彈性 (elastic) 的 IT 相關功能給使用者 Essential Characteristics Service Models Deployment Models 美國國家標準技術研究所 (NIST) 的定義 :
4
It’s About the Ecosystem IaaS PaaS SaaS Cloud Computing Generate Big Data Lead Business Insights create Competition, Innovation, Productivity Structured, Semi-structured Enterprise Data Warehouse
5
What is BigData? A set of filesA databaseA single file
6
What is the problem Getting the data to the processors becomes the bottleneck Quick calculation –Typical disk data transfer rate: 75MB/sec –Time taken to transfer 100GB of data to the processor: approx. 22 minutes!
7
The Era of Big Data – Are You Ready Businesses are driving the growth of big data. The capable data storage, efficient management, and capturing values to business values of huge size of data are enterprise big challenges. Overwhelming quantities of big data will challenge enterprise storage infrastructure and data center architecture which will cause chain reactions in database storage, data mining, business intelligence, cloud computing, and computing application. Data for business commercial analysis 2011: multi-terabyte (TB) 2020: 35.2 ZB (1 ZB = 1 billion TB)
8
Who Needs It? When to use? Affordable Storage/Compute Unstructured or Semi-structured Resilient Auto Scalability When to use? Ad-hoc Reporting (<1sec) Multi-step Transactions Lots of Inserts/Updates/Deletes Enterprise Database Hadoop
9
Hadoop!
10
– inspired by Apache Hadoop project –inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity – IT Costs Reduction
11
©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core HDFS MapReduce
12
©2011 Cloudera, Inc. All Rights Reserved. HDFS Hadoop Distributed File System Redundancy Fault Tolerant Scalable Self Healing Write Once, Read Many Times Java API Command Line Tool
13
©2011 Cloudera, Inc. All Rights Reserved. MapReduce 13 Two Phases of Functional Programming Redundancy Fault Tolerant Scalable Self Healing Java API
14
©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core 14 HDFS MapReduce Java
15
Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa
16
The Hadoop Ecosystems
17
The Ecosystem is the System Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache
18
Relation Map MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
19
Zookeeper – Coordination Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
20
What is ZooKeeper A centralized service for maintaining –Configuration information –Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data –Status information –Configuration –Location information
21
Flume / Sqoop – Data Integration Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
22
What’s the problem for data collection Data collection is currently a priori and ad hoc A priori – decide what you want to collect ahead of time Ad hoc – each kind of data source goes through its own collection path
23
(and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats
24
Flume: High-Level Overview Logical Node Source Sink
25
©2011 Cloudera, Inc. All Rights Reserved. Flume Architecture Log Flume Node Log Flume Node... HDFS
26
©2011 Cloudera, Inc. All Rights Reserved. Flume Sources and Sinks Local Files HDFS Stdin, Stdout Twitter IRC IMAP
27
Sqoop Easy, parallel database import/export What you want do? –Insert data from RDBMS to HDFS –Export data from HDFS back into RDBMS
28
©2011 Cloudera, Inc. All Rights Reserved. Sqoop 28 RDBMS Sqoop HDFS
29
©2011 Cloudera, Inc. All Rights Reserved. Sqoop Examples 29 $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AFG,H erat,1868004,Mazar-e-Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord- Holland,731200...
30
Pig / Hive – Analytical Language MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
31
Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce –Hive was initially developed at Facebook, Pig at Yahoo!
32
Hive – Developed by What is Hive? –An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop –MapRuduce for execution –HDFS for storage Hive Query Language –Basic-SQL : Select, From, Join, Group-By –Equi-Join, Muti-Table Insert, Multi-Group-By –Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
33
©2011 Cloudera, Inc. All Rights Reserved. Hive 33 MapReduce Hive SQL
34
Pig A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load ‘a.txt’ as (id, name, age,...) B = load ‘b.txt’ as (id, address,...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ A = load ‘a.txt’ as (id, name, age,...) B = load ‘b.txt’ as (id, address,...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ – Initiated by
35
©2011 Cloudera, Inc. All Rights Reserved. Pig MapReduce Pig Script
36
Hive vs. Pig
37
Input For the given sample input the map emits the reduce just sums up the values Hello World Bye World Hello Hadoop Goodbye Hadoop Hello World Bye World Hello Hadoop Goodbye Hadoop WordCount Example
38
WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
39
WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
40
WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
41
41 ©2011 Cloudera, Inc. All Rights Reserved. The Story So Far RDBMS HivePig Sqoop MapReduce HDFS FS SQL Script Posix Java Flume
42
Hbase – Column NoSQL DB MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
43
Structured-data vs Raw-data
44
I – Inspired by Coordinated by Zookeeper Low Latency Random Reads And Writes Distributed Key/Value Store Simple API –PUT –GET –DELETE –SCANE
45
Hbase – Data Model Cells are “versioned” Table rows are sorted by row key Region – a row range [start-key:end-key]
46
Hbase – workflow
47
©2011 Cloudera, Inc. All Rights Reserved. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable'
48
Oozie – Job Workflow & Scheduling MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
49
What is ? A Java Web Application Oozie is a workflow scheduler for Hadoop Crond for Hadoop Triggered –Time –Data Job 1 Job 3 Job 2 Job 4Job 5
50
©2011 Cloudera, Inc. All Rights Reserved. Oozie Features Component Independent –MapReduce –Hive –Pig –SqoopStreaming
51
Mahout – Data Mining MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
52
What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster
53
©2011 Cloudera, Inc. All Rights Reserved. Mahout Use Cases Yahoo: Spam Detection Foursquare: Recommendations SpeedDate.com: Recommendations Adobe: User Targetting Amazon: Personalization Platform
54
Use case Example Predict what the user likes based on –His/Her historical behavior –Aggregate behavior of people similar to him
55
Conclusion Today, we introduced: Why Hadoop is needed The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem
56
Recap – Hadoop Ecosystem MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
57
趨勢科技雲端防毒 Case Study
58
Collaboration in the underground
59
網路威脅呈現爆炸性的成長 各式各樣的變種病毒、垃圾郵件、不明的下載來源等等,這些來自網路上 的威脅,躲過傳統安全防護系統的偵測,一直持續呈現爆炸性的成長,形 成嚴重的資安威脅 New Unique Malware Discovered 1M unique Malwares every month 1M unique Malwares every month
61
New Design Concept for Threat Intelligence Web Crawler Trend Micro Endpoint Protection Trend Micro Endpoint Protection Trend Micro Mail Protection Trend Micro Mail Protection Trend Micro Web Protection Trend Micro Web Protection Honeypot CDN / xSP Human Intelligence 150M+ Worldwide Endpoints/Sensors
62
Challenges We Are Faced The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!
63
Raw Data Information Threat Intelligence/Solution Volume: Infinite Time: No Delay Target: Keep Changing Threats Issues to Address
64
SPN Feedback Log Receiver L4 Message Bus Log Receiver HBase MapReduce Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) CDN Log SPAM HTTP POST Web Pages HTTP Download Feedback Information SPN High Level Architecture Log Post Processing L4 Email Reputation Service SPN infrastructure Application Global Object Cache (GOC) Global Object Cache (GOC) Tracking Logging System (TLS) Tracking Logging System (TLS) Malware Classificati on Correlation Platform Web Reputation Service Web Reputation Service File Reputation Service File Reputation Service Lumber Jack Lumber Jack Adhoc-Query (Pig) Circus (Ambari) Circus (Ambari) Log Post Processing
65
Trend Micro Big Data process capacity 雲端防毒每日需要處理的資料量 85 億個 Web Reputation 查詢 30 億個 Email Reputation 查詢 70 億個 File Reputation 查詢 處理 6 TB 從全世界收集到的 raw logs 來自 1.5 億台終端裝置的連線
66
Trend Micro: Web Reputation Services User Traffic | Honeypot Akamai Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis 8 billions/day 4.8 billions/day 860 millions/day 40% filtered 82% filtered 25,000 malicious URL /day 99.98% filtered Trend Micro Products / Technology CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling Machine Learning Data Mining TechnologyProcessOperation Block malicious URL within 15 minutes once it goes online! 15 Minutes
67
Big Data Cases
69
Line Data on HBase Line data –MODEL: -> –INDEX: -> User: ->, Consistency in HBase Contact model: use column qualifier to store Support range query (e.g. message box)
70
Pig at Linkedin
71
Linkedin - Pig Example views = LOAD '/data/awesome' USING VoldemortStorage(); views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1’)
72
Facebook Messages
73
Facebook Open Source Stack Memcached --> App Server Cache ▪ZooKeeper --> Small Data Coordination Service ▪HBase --> Database Storage Engine ▪HDFS --> Distributed FileSystem ▪Hadoop --> Asynchronous Map-Reduce Jobs
74
Questions?
75
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.