Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 11 April, 2011 LECTURE #7 DATA STRUCTURES AND ALGORITHMS.

Similar presentations


Presentation on theme: "CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 11 April, 2011 LECTURE #7 DATA STRUCTURES AND ALGORITHMS."— Presentation transcript:

1 CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 11 April, 2011 LECTURE #7 DATA STRUCTURES AND ALGORITHMS USED IN CLOUD COMPUTING. DATABASES IN THE CLOUD.

2 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications2 OUTLINE Data structures and algorithms  MapReduce framework  Higher level extensions (DSL)  Distributed file systems Common design principles  Theoretical limitations: CAP theorem  BASE vs. ACID Database systems  What’s with the RDBMS ?  NoSQL solutions  Adapting RDBMS systems to NoSQL usage

3 Data Structures and Algorithms

4 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications4 Google MapReduce  Data-intensive text processing  Generic framework, inspired from functional programming  Word count, inverted index, page rank, graph algorithms  Simple, yet powerful idea  Open source implementations: Hadoop

5 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications5 Higher level extensions  Google: Sawzall  High-level language for performing parallel data analysis and data mining  Runs on Google MapReduce and GFS  Workflow/scheduling management for Sawzall jobs: Workqueue  Job chaining  Sawzall programs are compiled into an intermediate code, which is interpreted during runtime execution  Benefits  Programs are clearer, more compact, and more expressive  Programs are smaller, and easier to develop

6 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications6 Higher level extensions  Hadoop: Pig/Pig Latin  High-level data flow language, execution framework  Developed at Yahoo!  Pig programs are translated to MapReduce jobs, and ran on Hadoop  Pros  Pig programs are 1/20 the size of the equivalent MapReduce programs  Development time and efforts are significantly reduced  Cons  Execution time is 1.5 times longer

7 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications7 Pig example

8 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications8 Translation to Java MapReduce jobs

9 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications9 Distributed File Systems  Google File System  High-performance, scalable, distributed, fault-tolerant, running on commodity  Master-slave pattern  Open source implementation in C++: CloudStore (former KosmosFS)  Hadoop Distributed File System

10 Common Design Principles

11 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications11 CAP theorem (Brewer, Lynch)  Consistency  All nodes see the same data at the same time  Strong/Immediate consistency (ACID)  Eventual consistency (BASE)  Availability  Node failures do not prevent survivors from continuing to operate (All clients can find some available replica)  Partition tolerance  The system continues to operate despite arbitrary message loss (i.e. even when split into disconnected subsets, for example by a network disruption)  The theorem:  A distributed system can satisfy any two of these guarantees at the same time, but not all the three.

12 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications12 Cassandra’s consistency levels  ONE  Data is written after at least one node's commit table and memory table has been modified with the new data, and the node response has reached the client.  QUORUM  Data has to be written to /2 + 1 nodes before responding to the client.  ALL  All nodes have to read (write) the data.

13 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications13 BASE vs. ACID DAN PRITCHETT, EBAY  “In partitioned databases, trading some consistency for availability, can lead to dramatic improvements in scalability”. BASE architecture  Basically Available, Soft-state, Eventually consistent

14

15 Database Systems

16 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications16 What’s with the RDBMS ?  Highly-consistent transactional behaviour (ACID)  Atomicity  Consistency  Isolation  Durability  Relational model is simple, intuitive, and easy to understand (solid theory behind)  Highly normalized data  Powerful standardized query language (SQL)  Mature (20+ years), optimizations

17 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications17 Scalability, performance ?  Use external caching (Memcache)  Use indexes, table spaces or vendor-specific tuning (lose portability)  Introduce partitioning/sharding  Usually a manual process on per application basis  Increases landscape complexity  Requires distributed transactions  Some data is de-normalized for performance

18 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications18 RDBMS problems  Resource limits (indexes, data, storage engines behind, lock data, logs, etc.)  Weak elasticity/scalability  Partitioning/Sharding (data replication) – consistency can break availability  Schema changes are painful  Normalized data is slow to work on (how fast is to join million-row tables ?)  High cost to operate / expensive hardware

19 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications19 The NoSQL boom

20 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications20 NoSQL solutions  What is NoSQL ?  Not only SQL  Non-relational  What do they bring ?  Eventual consistency (BASE)  There is a weakly structured schema/no schema (documents)  JavaScript, JSON, REST  Distributed, horizontally scalable  Can cope with huge data  Directions of innovation  Column-based storage  In-memory computing  Different (non-relational) data structure  Multitenancy support

21 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications21 NoSQL categories

22 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications22 Entity-Attribute-Value model  Data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest  EAV database – a database where large portion of the data is modeled as EAV  Use cases  Attribute data types vary  Large number of data categories, but the number of instances in each category is small  Modeling a clinical record DB, or an e-shop DB  EAV in cloud computing  Amazon’s SimpleDB (only data type is string)  Microsoft’s Windows Azure Table Storage (allowed are byte[], bool, long, string, …)  Google App Engine – greatest variety, allows custom data types

23 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications23 Key value/ Tuple store  One key, one value, no duplicates, and crazy fast!  It is a distributed hash table  The value is a binary object (BLOB), the DB does not understand it  Amazon Dynamo, MemcacheDB, Redis, BerkeleyDB, Azure Table Storage

24 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications24 Document based storage  Key-value store, but the value is usually structured/semi-structured document, which is understood by the DB  Storage format is usually JSON  MapReduce based materialization views (kind of querying approach)  CouchDB, MongoDB

25 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications25 Document based storage - CouchDB  B-tree storage engine  No JOINs, no PK/FK  Implemented in Erlang (first version in C++) 

26 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications26 CouchDB views – MapReduce vs. SQL

27 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications27 Column based/ Wide column store/ Column Families  Each key is associated with many attributes (columns)  NoSQL column stores are actually hybrid row/column stores  Google Bigtable, Hypertable, HBase, Cassandra, Amazon SimpleDB

28 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications28 Google Bigtable  Distributed, high-performance, sparse multi-dimensional sorted map  Runs on other Google stuff: GFS, SSTable, Chubby, WorkQueue  MapReduce, Sawzall integration  Should cover the PetaByte data level, run on commodity hardware  → string

29 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications29 Google Bigtable  Columns are grouped in column families (family:qualifier)  CF are basic unit of access control  Versioning (timestamp handling, garbage collecting)  Keep the last N  Keep from the last 5 days  CF offer a way for the clients to exploit data locality

30 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications30 Google Bigtable building blocks - SSTable  File format for storing sorted key-value string pairs  Immutable, persistent  Stored in GFS, optionally mapped in memory (lazily)  Can be Bloom filtered  Consists of chunks of data plus an index

31 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications31 Google Bigtable building blocks - Tablet  Contains some range of rows from the table  Rows are sorted lexicographically  Built of multiple SSTables  After reaching some size are compressed  CF data is compressed together (locality)  Chosen to be ~200 MB for GFS reasons

32 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications32 Google Bigtable building blocks - Table  Multiple tablets make up the table  SSTables can be shared

33 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications33 Finding a tablet

34 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications34 Google Bigtable implementation – Three building blocks  Client library  Caches tablet location for performance  Master server (exactly one, kills himself in some scenarios)  Assigns tablets to tablet servers  Detects addition/expiration of tablet servers (Chubby)  TS load balancing  Garbage collection of GFS files  Schema changes such as table and column family creations  Tablet servers (~thousands, dynamically added/ removed)  Handles read/write requests to the tablets it has loaded  Splits/compresses tablets that have grown too large

35 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications35 Adapting RDBMS systems to NoSQL usage  MySQL  NDB cluster, NDB API  HandlerSocket plugin  http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story- for.html

36 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications36 NoSQL advantages/disadvantages  Pros  Open source  Some serious and innovative science behind  Massive data store, horizontally scalable  Great fit for many “Web 2.0”services and modern applications  Cons  They are no panacea!  No standards, different DBs are suitable for different tasks (DS DBs)  Limited query capabilities  We do not think non-relational … yet (mindset change is needed)  Some are still experimental (alpha)  Still most productive DB systems are relational  Probably a hybrid solution is the best

37 END OF LECTURE #7

38 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications38 The information in this document is compiled using varous public sources, freely available in internet. These sources include:  http://www.scribd.com/doc/17929394/Cloud-Computing-Use-Cases-Whitepaperhttp://www.scribd.com/doc/17929394/Cloud-Computing-Use-Cases-Whitepaper  http://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessmenthttp://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessment  http://code.google.com/edu/parallel/index.html http://code.google.com/edu/parallel/index.html  Google: Cluster Computing and MapReduce: http://code.google.com/edu/submissions/mapreduce-minilecture/listing.htmlhttp://code.google.com/edu/submissions/mapreduce-minilecture/listing.html  Google Course: MapReduce in a Week http://code.google.com/edu/submissions/mapreduce/listing.htmlhttp://code.google.com/edu/submissions/mapreduce/listing.html  Intensive MapReduce course at MIT http://mr.iap.2008.googlepages.comhttp://mr.iap.2008.googlepages.com  Hadoop Virtual Image Documentation http://code.google.com/edu/parallel/tools/hadoopvm/index.htmlhttp://code.google.com/edu/parallel/tools/hadoopvm/index.html  http://www.umiacs.umd.edu/~jimmylin/cloud-computinghttp://www.umiacs.umd.edu/~jimmylin/cloud-computing  Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis,  Evaluating MapReduce for Multi-core and Multiprocessor Systems, http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdfhttp://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf  http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousinghttp://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing  Bingsheng He, Wenbin Fang, Qiong Luo, Mars: A MapReduce Framework on Graphics Processors http://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdfhttp://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdf  Hung-chih Yang, Ali Dasdan, Map-reduce-merge: simplified relational data processing on large clusters http://portal.acm.org/citation.cfm?doid=1247480.1247602http://portal.acm.org/citation.cfm?doid=1247480.1247602  Foto N. Afrati, Jeffrey D. Ullman, A New Computation Model for Rack-Based Computing http://infolab.stanford.edu/~ullman/pub/mapred.pdfhttp://infolab.stanford.edu/~ullman/pub/mapred.pdf  Ralf Lammel, Google’s MapReduce Programming Model Revisite http://www.cs.vu.nl/~ralf/MapReduce/paper.pdfhttp://www.cs.vu.nl/~ralf/MapReduce/paper.pdf  http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1  Joe Hellerstein, Parallel Programming in the Age of Big Data http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programminghttp://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming  Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters https://sites.google.com/a/colgate.edu/cloudintro/Homehttps://sites.google.com/a/colgate.edu/cloudintro/Home © 2011 COPYRIGHTS DISCLAIMER The information in this document is proprietary to Sofia University “Sv. Kliment Ohridski” (called THE UNIVERSITY bellow) http://uni-sofia.bg THE UNIVERSITY assumes no responsibility for errors or omissions in this document. THE UNIVERSITY does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This document is used only for educational purposes related to the masters programs of THE UNIVERSITY, Faculty of Mathematics and Informatics. This document is compiled using various public sources freely available in internet or offered by SAP AG. This document is not used directly or indirectly for any type of commercial use. http://fmi.uni-sofia.bg THE UNIVERSITY shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence. The statutory liability for personal injury and defective products is not affected. THE UNIVERSITY has no control over the information that you may access through the use of hot links contained in these materials and does not endorse your use of third-party Web pages nor provide any warranty whatsoever relating to third-party Web pages.

39 2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications39 Headline area Drawing area White space The Grid


Download ppt "CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 11 April, 2011 LECTURE #7 DATA STRUCTURES AND ALGORITHMS."

Similar presentations


Ads by Google