Download presentation
Presentation is loading. Please wait.
Published byAriel Reynolds Modified over 8 years ago
1
CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 11 April, 2011 LECTURE #7 DATA STRUCTURES AND ALGORITHMS USED IN CLOUD COMPUTING. DATABASES IN THE CLOUD.
2
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications2 OUTLINE Data structures and algorithms MapReduce framework Higher level extensions (DSL) Distributed file systems Common design principles Theoretical limitations: CAP theorem BASE vs. ACID Database systems What’s with the RDBMS ? NoSQL solutions Adapting RDBMS systems to NoSQL usage
3
Data Structures and Algorithms
4
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications4 Google MapReduce Data-intensive text processing Generic framework, inspired from functional programming Word count, inverted index, page rank, graph algorithms Simple, yet powerful idea Open source implementations: Hadoop
5
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications5 Higher level extensions Google: Sawzall High-level language for performing parallel data analysis and data mining Runs on Google MapReduce and GFS Workflow/scheduling management for Sawzall jobs: Workqueue Job chaining Sawzall programs are compiled into an intermediate code, which is interpreted during runtime execution Benefits Programs are clearer, more compact, and more expressive Programs are smaller, and easier to develop
6
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications6 Higher level extensions Hadoop: Pig/Pig Latin High-level data flow language, execution framework Developed at Yahoo! Pig programs are translated to MapReduce jobs, and ran on Hadoop Pros Pig programs are 1/20 the size of the equivalent MapReduce programs Development time and efforts are significantly reduced Cons Execution time is 1.5 times longer
7
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications7 Pig example
8
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications8 Translation to Java MapReduce jobs
9
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications9 Distributed File Systems Google File System High-performance, scalable, distributed, fault-tolerant, running on commodity Master-slave pattern Open source implementation in C++: CloudStore (former KosmosFS) Hadoop Distributed File System
10
Common Design Principles
11
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications11 CAP theorem (Brewer, Lynch) Consistency All nodes see the same data at the same time Strong/Immediate consistency (ACID) Eventual consistency (BASE) Availability Node failures do not prevent survivors from continuing to operate (All clients can find some available replica) Partition tolerance The system continues to operate despite arbitrary message loss (i.e. even when split into disconnected subsets, for example by a network disruption) The theorem: A distributed system can satisfy any two of these guarantees at the same time, but not all the three.
12
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications12 Cassandra’s consistency levels ONE Data is written after at least one node's commit table and memory table has been modified with the new data, and the node response has reached the client. QUORUM Data has to be written to /2 + 1 nodes before responding to the client. ALL All nodes have to read (write) the data.
13
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications13 BASE vs. ACID DAN PRITCHETT, EBAY “In partitioned databases, trading some consistency for availability, can lead to dramatic improvements in scalability”. BASE architecture Basically Available, Soft-state, Eventually consistent
15
Database Systems
16
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications16 What’s with the RDBMS ? Highly-consistent transactional behaviour (ACID) Atomicity Consistency Isolation Durability Relational model is simple, intuitive, and easy to understand (solid theory behind) Highly normalized data Powerful standardized query language (SQL) Mature (20+ years), optimizations
17
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications17 Scalability, performance ? Use external caching (Memcache) Use indexes, table spaces or vendor-specific tuning (lose portability) Introduce partitioning/sharding Usually a manual process on per application basis Increases landscape complexity Requires distributed transactions Some data is de-normalized for performance
18
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications18 RDBMS problems Resource limits (indexes, data, storage engines behind, lock data, logs, etc.) Weak elasticity/scalability Partitioning/Sharding (data replication) – consistency can break availability Schema changes are painful Normalized data is slow to work on (how fast is to join million-row tables ?) High cost to operate / expensive hardware
19
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications19 The NoSQL boom
20
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications20 NoSQL solutions What is NoSQL ? Not only SQL Non-relational What do they bring ? Eventual consistency (BASE) There is a weakly structured schema/no schema (documents) JavaScript, JSON, REST Distributed, horizontally scalable Can cope with huge data Directions of innovation Column-based storage In-memory computing Different (non-relational) data structure Multitenancy support
21
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications21 NoSQL categories
22
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications22 Entity-Attribute-Value model Data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest EAV database – a database where large portion of the data is modeled as EAV Use cases Attribute data types vary Large number of data categories, but the number of instances in each category is small Modeling a clinical record DB, or an e-shop DB EAV in cloud computing Amazon’s SimpleDB (only data type is string) Microsoft’s Windows Azure Table Storage (allowed are byte[], bool, long, string, …) Google App Engine – greatest variety, allows custom data types
23
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications23 Key value/ Tuple store One key, one value, no duplicates, and crazy fast! It is a distributed hash table The value is a binary object (BLOB), the DB does not understand it Amazon Dynamo, MemcacheDB, Redis, BerkeleyDB, Azure Table Storage
24
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications24 Document based storage Key-value store, but the value is usually structured/semi-structured document, which is understood by the DB Storage format is usually JSON MapReduce based materialization views (kind of querying approach) CouchDB, MongoDB
25
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications25 Document based storage - CouchDB B-tree storage engine No JOINs, no PK/FK Implemented in Erlang (first version in C++)
26
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications26 CouchDB views – MapReduce vs. SQL
27
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications27 Column based/ Wide column store/ Column Families Each key is associated with many attributes (columns) NoSQL column stores are actually hybrid row/column stores Google Bigtable, Hypertable, HBase, Cassandra, Amazon SimpleDB
28
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications28 Google Bigtable Distributed, high-performance, sparse multi-dimensional sorted map Runs on other Google stuff: GFS, SSTable, Chubby, WorkQueue MapReduce, Sawzall integration Should cover the PetaByte data level, run on commodity hardware → string
29
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications29 Google Bigtable Columns are grouped in column families (family:qualifier) CF are basic unit of access control Versioning (timestamp handling, garbage collecting) Keep the last N Keep from the last 5 days CF offer a way for the clients to exploit data locality
30
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications30 Google Bigtable building blocks - SSTable File format for storing sorted key-value string pairs Immutable, persistent Stored in GFS, optionally mapped in memory (lazily) Can be Bloom filtered Consists of chunks of data plus an index
31
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications31 Google Bigtable building blocks - Tablet Contains some range of rows from the table Rows are sorted lexicographically Built of multiple SSTables After reaching some size are compressed CF data is compressed together (locality) Chosen to be ~200 MB for GFS reasons
32
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications32 Google Bigtable building blocks - Table Multiple tablets make up the table SSTables can be shared
33
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications33 Finding a tablet
34
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications34 Google Bigtable implementation – Three building blocks Client library Caches tablet location for performance Master server (exactly one, kills himself in some scenarios) Assigns tablets to tablet servers Detects addition/expiration of tablet servers (Chubby) TS load balancing Garbage collection of GFS files Schema changes such as table and column family creations Tablet servers (~thousands, dynamically added/ removed) Handles read/write requests to the tablets it has loaded Splits/compresses tablets that have grown too large
35
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications35 Adapting RDBMS systems to NoSQL usage MySQL NDB cluster, NDB API HandlerSocket plugin http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story- for.html
36
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications36 NoSQL advantages/disadvantages Pros Open source Some serious and innovative science behind Massive data store, horizontally scalable Great fit for many “Web 2.0”services and modern applications Cons They are no panacea! No standards, different DBs are suitable for different tasks (DS DBs) Limited query capabilities We do not think non-relational … yet (mindset change is needed) Some are still experimental (alpha) Still most productive DB systems are relational Probably a hybrid solution is the best
37
END OF LECTURE #7
38
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications38 The information in this document is compiled using varous public sources, freely available in internet. These sources include: http://www.scribd.com/doc/17929394/Cloud-Computing-Use-Cases-Whitepaperhttp://www.scribd.com/doc/17929394/Cloud-Computing-Use-Cases-Whitepaper http://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessmenthttp://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessment http://code.google.com/edu/parallel/index.html http://code.google.com/edu/parallel/index.html Google: Cluster Computing and MapReduce: http://code.google.com/edu/submissions/mapreduce-minilecture/listing.htmlhttp://code.google.com/edu/submissions/mapreduce-minilecture/listing.html Google Course: MapReduce in a Week http://code.google.com/edu/submissions/mapreduce/listing.htmlhttp://code.google.com/edu/submissions/mapreduce/listing.html Intensive MapReduce course at MIT http://mr.iap.2008.googlepages.comhttp://mr.iap.2008.googlepages.com Hadoop Virtual Image Documentation http://code.google.com/edu/parallel/tools/hadoopvm/index.htmlhttp://code.google.com/edu/parallel/tools/hadoopvm/index.html http://www.umiacs.umd.edu/~jimmylin/cloud-computinghttp://www.umiacs.umd.edu/~jimmylin/cloud-computing Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, Evaluating MapReduce for Multi-core and Multiprocessor Systems, http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdfhttp://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousinghttp://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing Bingsheng He, Wenbin Fang, Qiong Luo, Mars: A MapReduce Framework on Graphics Processors http://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdfhttp://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdf Hung-chih Yang, Ali Dasdan, Map-reduce-merge: simplified relational data processing on large clusters http://portal.acm.org/citation.cfm?doid=1247480.1247602http://portal.acm.org/citation.cfm?doid=1247480.1247602 Foto N. Afrati, Jeffrey D. Ullman, A New Computation Model for Rack-Based Computing http://infolab.stanford.edu/~ullman/pub/mapred.pdfhttp://infolab.stanford.edu/~ullman/pub/mapred.pdf Ralf Lammel, Google’s MapReduce Programming Model Revisite http://www.cs.vu.nl/~ralf/MapReduce/paper.pdfhttp://www.cs.vu.nl/~ralf/MapReduce/paper.pdf http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1 Joe Hellerstein, Parallel Programming in the Age of Big Data http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programminghttp://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters https://sites.google.com/a/colgate.edu/cloudintro/Homehttps://sites.google.com/a/colgate.edu/cloudintro/Home © 2011 COPYRIGHTS DISCLAIMER The information in this document is proprietary to Sofia University “Sv. Kliment Ohridski” (called THE UNIVERSITY bellow) http://uni-sofia.bg THE UNIVERSITY assumes no responsibility for errors or omissions in this document. THE UNIVERSITY does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This document is used only for educational purposes related to the masters programs of THE UNIVERSITY, Faculty of Mathematics and Informatics. This document is compiled using various public sources freely available in internet or offered by SAP AG. This document is not used directly or indirectly for any type of commercial use. http://fmi.uni-sofia.bg THE UNIVERSITY shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence. The statutory liability for personal injury and defective products is not affected. THE UNIVERSITY has no control over the information that you may access through the use of hot links contained in these materials and does not endorse your use of third-party Web pages nor provide any warranty whatsoever relating to third-party Web pages.
39
2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications39 Headline area Drawing area White space The Grid
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.