Download presentation
Presentation is loading. Please wait.
Published byRoxanne Palmer Modified over 9 years ago
1
L/O/G/O 云端的小飞象系列报告之二 Cloud 组
2
L/O/G/O Hadoop in SIGMOD 2011 www.themegallery.com
3
Outline Introduction Nova: Continuous Pig/Hadoop Workflows Apache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data Analytics A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses
4
Industrial Session in Sigmod 2011 Data Management for Feeds and Streams(2) Dynamic Optimization and Unstructured Content (4) BusinessAnalytics(2) Support for Business Analytics and Warehousing (4) Applying Hadoop (4) Industrial session
6
Nova: Continuous Pig/Hadoop Workflows By Yahoo !
7
Nova Overview Scenarios Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web pages Processing semi-structured data Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features
8
Workflow Model Workflow Two kinds of vertices: tasks (processing steps) and channels (data containers) Edges connect tasks to channels and channels to tasks Four common patterns of processing Non-incremental (template detection) Stateless incremental (shingling) Stateless incremental with lookup table (template tagging) Stateful incremental (de-duping)
9
Workflow Model (Cont.) Data and Update Model Blocks: A channel’s data is divided into blocks Contains a complete snapshot of data on a channel as of some point in time Base blocks are assigned increasing sequence numbers(B 0, B 1, B 2…… B n ) Base block Used in conjunction with incremental processing Contains instructions for transforming a base block into a new base block( ) Delta block
10
Workflow Model (Cont.) Task/Data Interface Consumption mode: all or new Production mode: B or Δ
11
Workflow Model (Cont.) Workflow Programming and Scheduling Data-based trigger. Time-based trigger Cascade trigger. Data Compaction and Garbage Collection If a channel has blocks B 0 , ,, , the compaction operation computes and adds B 3 to the channel After compaction is used to add B 3 to the channel , and current cursor is at sequence number 2 , then B 0 , , can be garbage-collected.
12
Nova System Architecture
13
Apache Hadoop Goes Realtime at Facebook By Facebook
14
Workload Types Facebook Messaging High Write Throughput Large Tables Data Migration Facebook Insights Realtime Analytics High Throughput Increments Facebook Metrics System (ODS) Automatic Sharding Fast Reads of Recent Data and Table Scans
15
Why Hadoop & HBase Elasticity High write throughput Efficient and low-latency strong consistency semantics within a data center Efficient random reads from disk High Availability and Disaster Recovery Fault Isolation Atomic read-modify-write primitives Range Scans Tolerance of network partitions within a single data center Zero Downtime in case of individual data center failure Active-active serving capability across different data centers
16
Realtime HDFS High Availability - AvatarNode
17
Realtime HDFS (Cont.) Hadoop RPC compatibility Block Availability: Placement Policy a pluggable block placement policy
18
Realtime HDFS (Cont.) Performance Improvements for a Realtime Workload RPC Timeout Reads from Local Replicas New Features HDFS sync Concurrent Readers
19
Production HBase ACID Compliance (RWCC: Read Write Consistency Control) Atomicity (WALEdit) Consistency Availability Improvements HBase Master Rewrite , Region assignment in memory -> ZooKeeper Online Upgrades Distributed Log Splitting Performance Improvements Compaction ( minor and major ) Read Optimizations
20
Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse By IBM
21
Motivation 1.Increasing volumes of data 2. Hadoop-based solutions in conjunction with data warehouses
23
A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses By Teradata
24
Motivation ETL(Extraction Transformation Loading) is a critical part of data warehouse While data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)
25
Why Hadoop for Teradata EDW ( Enterprise Data Warehouse ) ? More disk space can be easily added Use as a intermediate storage MapReduce for transformation Load data in parallel
26
Block Assignment Problem –HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P) – The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n blocks (X = {1,..., n}) of F Y is the set of m nodes running PDBMS (called PDBMS nodes) (Y ⊆ {1,..., P }) k copies, m nodes r is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.
27
Block Assignment Problem ( Cont. ) An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1,..., n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j. An even assignment g is an assignment such that ∀ i ∈ Y ∀ j ∈ Y | |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1. The cost of an assignment g is defined to be cost(g) = |{i | g(i) r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes.
28
L/O/G/O Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.