Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Similar presentations


Presentation on theme: "MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …"— Presentation transcript:

1 www.decideo.fr/bruley MapReduce michel.bruley@teradata.com April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

2 www.decideo.fr/bruley What is MapReduce? n Restricted parallel programming model meant for large clusters –User implements Map() and Reduce()‏ functions n Parallel computing framework –Libraries take care of EVERYTHING else Parallelization Fault Tolerance Data Distribution Load Balancing n Useful model for many practical tasks

3 www.decideo.fr/bruley Map and Reduce n The idea of Map, and Reduce is 40+ year old –Present in all Functional Programming Languages. –See, e.g., APL, Lisp and ML n Alternate names for Map: Apply-All n Higher Order Functions –take function definitions as arguments, or –return a function as output n Map and Reduce are higher-order functions.

4 www.decideo.fr/bruley Map and Reduce Functions n Functions borrowed from functional programming languages (eg. Lisp)‏ n Map()‏ –Process a key/value pair to generate intermediate key/value pairs n Reduce()‏ –Merge all intermediate values associated with the same key

5 www.decideo.fr/bruley Example: Counting Words n Map()‏ –Input –Parses file and emits pairs eg. n Reduce()‏ –Sums all values for the same key and emits eg. =>

6 www.decideo.fr/bruley Execution on Clusters 1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return

7 www.decideo.fr/bruley Map/Reduce Cluster Implementation split 0 split 1 split 2 split 3 split 4 Output 0 Output 1 Input files Output files M map tasks R reduce tasks Intermediate files Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition

8 www.decideo.fr/bruley Map Reduce vs. Parallel Databases n Map Reduce widely used for parallel processing –Google, Yahoo, and 100’s of other companies –Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. n Database people say: –but parallel databases have been doing this for decades n Map Reduce people say: –we operate at scales of 1000’s of machines –We handle failures seamlessly –We allow procedural code in map and reduce and allow data of any type

9 www.decideo.fr/bruley Typical MapReduce Cluster

10 www.decideo.fr/bruley Map Reduce Implementations n Google –Not available outside Google n Hadoop –An open-source implementation in Java –Uses HDFS for stable storage –Download: http://lucene.apache.org/hadoop/ http://lucene.apache.org/hadoop/ n Teradata Aster –Cluster-optimized SQL Database that also implements MapReduce IITB alumnus among founders n And several others, such as Cassandra at Facebook, etc.

11 www.decideo.fr/bruley MapReduce v. Hadoop MapReduceHadoop OrgGoogleYahoo/Apache ImplC++Java Distributed File Sys GFSHDFS Data BaseBigtableHBase Distributed lock mgr ChubbyZooKeeper

12 www.decideo.fr/bruley Solutions Stack for Teradata Aster Aster Data nCluster Business Intelligence Tools Analytics Specialists Data Integration / ETL Systems Management Security Query Tools Servers Operating System Cloud Infrastructure Aster Data Ecosystem Aster Data Platform Infrastructure Storage

13 www.decideo.fr/bruley Teradata Aster Platform Infrastructure For physical infrastructure (non-cloud) deployments Server Hardware Operating System Aster Data Analytic Platform Certified commodity (x86) server hardware with internal storage Certified Linux operating system Aster Data nCluster packaged software nCluster

14 www.decideo.fr/bruley Teradata Aster Infrastructure For cloud deployments Compute Instance Compute instance from cloud provider (e.g. Amazon Web Services EC2) CC xLarge Storage Storage connected to cloud computing capacity EBS Ephemeral Operating System Aster Data Analytic Platform Linux operating system Aster Data nCluster packaged software nCluster

15 www.decideo.fr/bruley Teradata Aster Architecture for Analytics Your Analytics & Advanced Reporting Applications Aster Data nCluster Massively Parallel Data Stores Hybrid row/column DBMS Linear, incremental scalability Commodity hardware Standard SQL interface MapReduce processing integrated with SQL via SQL-MapReduce interface Rich libraries of MapReduce analytics from Aster Data and partners Visual development environment--develop in hours Unified Interface SQL SQL-MapReduce Analytic Functions and Frameworks Optimized SQL engine Fully-integrated in-database MapReduce Analytics Processing Engines App SQLMapReduce… Support for in-database processing of custom applications written in broad variety of languages Integration with third-party packaged software via ODBC/JDBC or in-database integration

16 www.decideo.fr/bruley Teradata Aster Ecosystem PartnerProduct Product release Platform for Certification MicroStrategyIntelligence Server9.2.1 32-bitWindows 7, Enterprise Edition SP1, 32-bit, 64-bit SAPBusiness ObjectsXI 3.1Windows 2008, 32-bit InformaticaPowercenter9.0.1 Client: Windows 2003/2008 Server 32 bit. Server: Windows 2003/2008 Server 32 bit and 64 bit IBMCognos10.1FP1n/a TableauTableau Server6Windows (SS: TBU) Microsoft SSLS, SSAS, SSFS, SSIS SQL Server 2008.NET Framework 2.0 Windows Server, 2008 64-bit Windows 2003, 32-bit *Oracle BIEE certification currently in process


Download ppt "MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …"

Similar presentations


Ads by Google