Download presentation
Presentation is loading. Please wait.
Published byDerrick Strickland Modified over 8 years ago
1
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
2
The Google File System
3
The Google File System (GFS) A scalable distributed file system for large distributed data intensive applications Multiple GFS clusters are currently deployed. The largest ones have: 1000+ storage nodes 300+ TeraBytes of disk storage heavily accessed by hundreds of clients on distinct machines
4
Introduction Shares many same goals as previous distributed file systems performance, scalability, reliability, etc GFS design has been driven by four key observation of Google application workloads and technological environment
5
Intro: Observations 1 1. Component failures are the norm constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system 2. Huge files (by traditional standards) Multi GB files are common I/O operations and blocks sizes must be revisited
6
Intro: Observations 2 3. Most files are mutated by appending new data This is the focus of performance optimization and atomicity guarantees 4. Co-designing the applications and APIs benefits overall system by increasing flexibility
7
The Design Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients
8
Operation Log Record of all critical metadata changes Stored on Master and replicated on other machines Defines order of concurrent operations Changes not visible to clients until they propigate to all chunk replicas Also used to recover the file system state
9
System Interactions: Leases and Mutation Order
10
Introduction to MapReduce 18th Jan 201610Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
11
MapReduce: Insight ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” How would you do it in parallel ? 18th Jan 201611Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
12
Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. 18th Jan 201612Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
13
Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. 18th Jan 201613Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
14
MapReduce: Execution overview 18th Jan 201614Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
15
MapReduce: Example 18th Jan 201615Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
16
MapReduce in Parallel: Example 18th Jan 201616Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
17
18th Jan 201617Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
18
MapReduce: Some More Apps Distributed Grep. Count of URL Access Frequency. Clustering (K-means) Graph Algorithms. Indexing Systems MapReduce Programs In Google Source Tree 18th Jan 201618Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
19
MapReduce: Extensions and similar apps PIG (Yahoo) Hadoop (Apache) DryadLinq (Microsoft) 18th Jan 201619Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
20
Large Scale Systems Architecture using MapReduce User App MapReduce Distributed File Systems (GFS) 18th Jan 201620Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
21
BigTable: A Distributed Storage System for Structured Data 18th Jan 201621Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
22
Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products 18th Jan 201622Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
23
Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank, … Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is large Billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands or q/sec 100TB+ of satellite image data 18th Jan 201623Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
24
Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren’t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer 18th Jan 201624Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
25
BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 18th Jan 201625Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
26
Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications 18th Jan 201626Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
27
WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched. 18th Jan 201627Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
28
Rows Name is an arbitrary string Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 18th Jan 201628Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
29
Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys 18th Jan 201629Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
30
18th Jan 201630Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
31
18th Jan 201631Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
32
18th Jan 201632Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
33
18th Jan 201633Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
34
18th Jan 201634Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
35
18th Jan 201635Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
36
18th Jan 201636Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
37
18th Jan 201637Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
38
18th Jan 201638Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
39
18th Jan 201639Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
40
18th Jan 201640Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.