Big Data Analytics Map-Reduce and Spark

Big Data Analytics Map-Reduce and Spark
? Slides by: Joseph E. Gonzalez With revisions by: Josh Hug

From SQL to Big Data (with SQL)
Last week… Databases (Relational) Database Management Systems SQL: Structured Query Language Today More on databases and database design Enterprise data management and the data lake Introduction to distributed data storage and processing Spark 1. Talk more about databases, how we design data models, etc and cover how real enterprises think about data management, both traditionally and in a more modern sense 2. Discuss how modern systems accomplish distribution 3. Spark

Data in the Organization
Operational Data Store CUBE Data Warehouse ROLLUP Data in the Organization Drill Down ETL (Extract, Transform, Load) Snowflake Schema Schema on Read A little bit of buzzword bingo! Star Schema Data Lake OLAP (Online Analytics Processing)

How we like to think of data in the organization
Inventory How we like to think of data in the organization

The reality… Sales (Asia) Sales (US) Inventory Advertising

Operational Data Stores
Data Warehouse Operational Data Stores Sales (Asia) Capture the now Many different databases across an organization Mission critical… be careful! Serving live ongoing business operations Managing inventory Different formats (e.g., currency) Different schemas (acquisitions ...) Live systems often don’t maintain history Collects and organizes historical data from multiple sources Sales (US) Inventory Why do we want history: What was the balance of my account last week? We don’t have that; only the current state. What if you want analysis over time? Advertising We would like a consolidated, clean, historical snapshot of the data.

Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources Sales (US) Inventory Data is periodically ETLed into the data warehouse: Extracted from remote sources Transformed to standard schemas Loaded into the (typically) relational (SQL) data system Advertising

Extract  Transform  Load (ETL)
Extract & Load: provides a snapshot of operational data Historical snapshot Data in a single system Isolates analytics queries (e.g., Deep Learning) from business critical services (e.g., processing user purchase) Easy! Transform: clean and prepare data for analytics in a unified representation Difficult  often requires specialized code and tools Different schemas, encodings, granularities What could go wrong here? Don’t want to drop data silently. System goes down when you make the request. Input data format changes. (This happens!)

Data Warehouse How is data organized in the Data Warehouse? ETL
Sales (Asia) ETL Collects and organizes historical data from multiple sources Sales (US) Inventory How is data organized in the Data Warehouse? Advertising

Example Sales Data Big table: many columns and rows
pname category price qty date day city state country Corn Food 25 3/30/16 Wed. Omaha NE USA 8 3/31/16 Thu. 15 4/1/16 Fri. Galaxy Phones 18 30 1/30/16 20 50 Peanuts 2 45 Seoul Korea 100 Big table: many columns and rows Substantial redundancy  expensive to store and access Make mistakes while updating Could we organize the data more efficiently? Why would we want to normalize? Let’s say we wanted to change the price of an item…SQL queries can get really messy and expensive when you have a lot of repeated data.

Multidimensional Data Model
Sales Fact Table Locations Dimension Tables pid timeid locid sales 11 1 25 2 8 3 15 12 30 20 50 13 10 35 22 26 45 40 5 locid city state country 1 Omaha Nebraska USA 2 Seoul Korea 5 Richmond Virginia Products Multidimensional “Cube” of data pid pname category price 11 Corn Food 25 12 Galaxy 1 Phones 18 13 Peanuts 2 Time Id 1 2 3 25 8 15 30 20 50 10 11 12 13 5 Product Id Location Id Time timeid Date Day 1 3/30/16 Wed. 2 3/31/16 Thu. 3 4/1/16 Fri.

The Star Schema Products Time Sales Fact Table Locations
pid pname category price timeid Date Day Sales Fact Table pid timeid locid sales  This looks like a star … Collection of tables with the fact table at the center and a variety of key relationships. Locations locid city state country

The Star Schema  This looks like a star …

The Snowflake Schema  This looks like a snowflake …
The dimension tables become fact tables themselves => snowflake schemas. All of these are mentioned very commonly in data warehouses. See CS 186 for more!

Online Analytics Processing (OLAP)
Users interact with multidimensional data: Constructing ad-hoc and often complex SQL queries Using graphical tools that to construct queries e.g. Tableau Let’s discuss some common types of queries used in OLAP. Okay, so now we understand multidimensional data models. How do users interact with this data?

Cross Tabulation (Pivot Tables)
Item Color Quantity Desk Blue 2 Red 3 Sofa 4 5 Item Desk Sofa Sum Color Blue 2 4 6 Red 3 5 8 9 14 Aggregate data across pairs of dimensions Pivot Tables: graphical interface to select dimensions and aggregation function (e.g., SUM, MAX, MEAN) GROUP BY queries Related to contingency tables and marginalization in stats. What about many dimensions? This is the same as computing the marginal distributoins over each of the rows & columns. How does this generalize to many dimensions?

Cube Operator Generalizes cross- tabulation to higher dimensions.
* 80 27 75 63 38 48 100 28 176 Cube Operator Location Id 5 63 38 75 48 100 28 176 What is here? 25 8 15 30 20 50 10 11 12 13 1 2 75 176 63 38 75 48 100 28 176 * Generalizes cross- tabulation to higher dimensions. Product Id In SQL: SELECT Item, Color, SUM(Quantity) AS QtySum FROM Furniture GROUP BY CUBE (Item, Color); 1 2 3 Time Id Item Color QtySum Desk Blue 2 Red 3 * 5 Sofa 4 9 14 6 8 For a very alrge database, this is a very efficient way to compute all of your roll-ups at once. Item Color Quantity Desk Blue 2 Red 3 Sofa 4 5

OLAP Queries Slicing: selecting a value for a dimension
Dicing: selecting a range of values in multiple dimension 25 8 15 30 20 50 10 3 7 6 9 2 33 42 5 3 7 6 9 2 33 42 5 25 8 15 30 20 50 10 OLAP Queries 25 15 30 13 42 8 5 7 3 6 9 33 How would we do a slice? A simple WHERE clause. Dicing is a more complex WHERE clause that touches multiple dimensions. 2 42 5 20 50 10 2 42 5 20 50 10

OLAP Queries Rollup: Aggregating along a dimension
Drill-Down: de-aggregating along a dimension 25 8 15 30 20 50 10 3 7 6 9 2 33 42 5 25 8 15 30 20 50 10 3 7 6 9 2 33 42 5 Similar to CUBE 93 91 56 59 42 67 87 120 75 25 8 15 30 20 50 10 OLAP Queries 25 8 15 30 20 50 10 3 7 6 9 2 33 42 5 Day 1 Day 2 Day 3 15 6 8 23 10 36 4 7 5 Day 1 Day 2 Day 3 Roll-up is a group by query where you group along dimensions. Drill down de-aggregates data. Imagine we had aggregates per day… we’ll separate out each day into finer timeslices (e.g., by shift)

Reporting and Business Intelligence (BI)
Use high-level tools to interact with their data: Automatically generate SQL queries Queries can get big! Common! All of this is often done via BI tools that automatically generate these queries – people usually don’t write these queries by hand in practice.

Data Warehouse So far … Star Schemas Data cubes OLAP Queries ETL ETL
Sales (Asia) ETL Collects and organizes historical data from multiple sources Sales (US) ETL ETL Inventory So far … Star Schemas Data cubes OLAP Queries ETL Advertising

Data Warehouse ETL ETL ETL ETL ETL ?
Sales (Asia) Data Warehouse ETL Sales (US) Collects and organizes historical data from multiple sources ETL Inventory ETL Advertising ETL How do we deal with semi-structured and unstructured data? Do we really want to force a schema on load? ETL ? in traditional ETL, everything had to conform to predetermined schema before load happened... doesn't necessarily work with this kind of data -- especially audio / video / images Text/Log Data Photos & Videos It is Terrible!

Data Warehouse Collects and organizes historical data from multiple sources iid date_taken is_cat is_grumpy image_data 1 How do we deal with semi-structured and unstructured data? Do we really want to force a schema on load? Unclear what a good schema for this image data might look like. Something like above will work, but it is inflexible!

Data Warehouse Data Lake * ETL ETL ETL ETL ETL ?
Sales (Asia) Data Lake * How do we clean and organize this data? ETL Sales (US) Store a copy of all the data in one place in its original “natural” form Enable data consumers to choose how to transform and use data. Schema on Read ETL Inventory Depends on use … ETL Advertising ETL ETL ? How do we load and process this data in a relation system? Text/Log Data Depends on use … Can be difficult ... Requires thought ... Photos & Videos It is Terrible! What could go wrong? *Still being defined…[Buzzword Disclaimer]

The Dark Side of Data Lakes
Cultural shift: Curate  Save Everything! Noise begins to dominate signal Limited data governance and planning Example: hdfs://important/joseph_big_file3.csv_with_json What does it contain? When and who created it? No cleaning and verification  lots of dirty data New tools are more complex and old tools no longer work Enter the data scientist

A Brighter Future for Data Lakes
Enter the data scientist Data scientists bring new skills Distributed data processing and cleaning Machine learning, computer vision, and statistical sampling Technologies are improving SQL over large files Self describing file formats (e.g. Parquet) & catalog managers Organizations are evolving Tracking data usage and file permissions New job title: data engineers

How do we store and compute on large unstructured datasets
Requirements: Handle very large files spanning multiple computers Use cheap commodity devices that fail frequently Distributed data processing quickly and easily Solutions: Distributed file systems  spread data over multiple machines Assume machine failure is common  redundancy Distributed computing  load and process files on multiple machines concurrently Functional programming computational pattern  parallelism

Distributed File Systems Storing very large files

Fault Tolerant Distributed File Systems
Big File How do we store and access very large files across cheap commodity devices ? [Ghemawat et al., SOSP’03]

Big File Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]

Big File A B B B Split the file into smaller parts. How? Ideally at record boundaries What if records are big? C C C D D D Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]

Big File B B B C C C D D D Machine1 Machine 2 Machine 3 Machine 4 A A A [Ghemawat et al., SOSP’03]

Big File C C C D D D Machine1 Machine 2 Machine 3 Machine 4 B A B A A B [Ghemawat et al., SOSP’03]

Big File D D D Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B C [Ghemawat et al., SOSP’03]

Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

Big File Split large files over multiple machines Easily support massive files spanning machines Read parts of file in parallel Fast reads of large files Often built using cheap commodity storage devices Machine1 Machine 2 B C A B D C Machine 3 Machine 4 Cheap commodity storage devices will fail! A C A B D D [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Failure Event
Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

Big File Machine1 B C D Machine 2 Machine 3 A C D Machine 4 A B A B C D [Ghemawat et al., SOSP’03]

Big File Machine 2 Machine 4 A A B A B B C C D D [Ghemawat et al., SOSP’03]

Big File B C D Machine 2 Machine 4 A B A B C D [Ghemawat et al., SOSP’03]

Map-Reduce Distributed Aggregation
Computing across very large files

Interacting With the Data
Algorithm Cluster Compute Query: Response: Good for bigger datasets and compute intensive tasks Good for smaller datasets Faster more natural interaction Lots of tools! Request Data Sample Sample of Data Algorithm Compute Locally

How would you compute the number of occurrences of each word in all the books using a team of people?

Simple Solution

Simple Solution 1) Divide Books Across Individuals

Simple Solution 1) Divide Books Across Individuals
2) Compute Counts Locally Word Count Apple 2 Bird 7 … Word Count Apple Bird 1 …

Simple Solution 1) Divide Books Across Individuals
2) Compute Counts Locally 3) Aggregate Tables Word Count Apple 2 Bird 7 … Word Count Apple 2 Bird 8 … Word Count Apple Bird 1 …

The Map Reduce Abstraction
Value Key Value Key Map Reduce Record Value Key Example: Word-Count Reduce(word, counts) { sum = 0 for count in counts: sum += count emit (word, SUM(counts)) } Map(book): for (word in set(book)): emit (word, book.count(word)) Key Value [Dean & Ghemawat, OSDI’04]

The Map Reduce Abstraction (Simpler)
Value Key Value Key Map Reduce Record Value Key Example: Word-Count Reduce(word, counts) { sum = 0 for count in counts: sum += count emit (word, SUM(counts)) } Map(book): for (word in book): emit (word, 1) Key Value [Dean & Ghemawat, OSDI’04]

The Map Reduce Abstraction (General)
Value Key Value Key Map Reduce Record Value Key Example: Word-Count Reduce(key, values, f) { agg = f(values[0], values[1]) for value in values[2:]: agg = f(agg, value) emit (word, agg) } Map(record, f): for (key in record): emit (key, f(key)) Key Value Map: Deterministic Reduce: Commutative and Associative [Dean & Ghemawat, OSDI’04]

Key properties of Map And Reduce
Deterministic Map: allows for re-execution on failure If some computation is lost we can always re-compute Issues with samples? Commutative Reduce: allows for re-order of operations Reduce(A,B) = Reduce(B,A) Example (addition): A + B = B + A Associative Reduce: allows for regrouping of operations Reduce(Reduce(A,B), C) = Reduce(A, Reduce(B,C)) Example (max): max(max(A,B), C) = max(A, max(B,C)) Warning: Floating point operations (e.g. addition) are not guaranteed associative. Remember: we need to be able to deal with failures.

Question Suppose our reduction function computes a*b + 1.
Suppose we have 3 values associated with the key ‘cat’. What is the result of the reduction operation? 3 cat 1 Reduce 2 ??? cat

Executing Map Reduce Value Key Map Record

Executing Map Reduce B C D A B C A C D A B D Map Map Map Map
Machine1 B C D Executing Map Reduce Machine 2 A B C Map Map Map Map Distributing the Map Function Machine 3 A C D Machine 4 A B D

Executing Map Reduce B C D A B C A C D A B D Map Map
Machine1 B C D Executing Map Reduce Map Machine 2 A B C Map Distributing the Map Function Machine 3 A C D Map Machine 4 A B D Map

Executing Map Reduce 1 B C 1 D 2 1 A B 3 C 1 A C 1 D 1 A B 1 D 1 Map
Machine1 1 apple Executing Map Reduce B C Map 1 cat D 2 the Machine 2 1 big A B Map 3 cat C 1 the The map function applied to a local part of the big file. Run in Parallel. Machine 3 A C Map 1 cat D 1 dog Machine 4 A B Map 1 big D 1 dog Output is cached for fast recovery on node failure

Executing Map Reduce 1 B C 1 D 2 1 A B 3 C 1 A C 1 D 1 A B 1 D 1 Map
Machine1 apple 1 Executing Map Reduce B C Map cat 1 D the 2 Machine 2 big 1 A B Map cat 3 C the 1 Machine 3 Value Key Reduce A C Map cat 1 Value Key D dog 1 Machine 4 A B Map big 1 Reduce function can be run on many machines … D dog 1

Executing Map Reduce Run in Parallel B C 1 1 D 2 3 1 A B C 1 3 5 1 A C
Machine1 Reduce B C Map apple 1 1 apple D the Reduce 2 Machine 2 3 the 1 A B Map C cat 1 Reduce Run in Parallel 3 5 cat Machine 3 1 A C Map Reduce big 1 D 2 big 1 Machine 4 A B Map Reduce dog 1 2 dog 1 D

Executing Map Reduce B C 1 D 2 1 A B 1 C 1 3 3 5 1 2 A C 2 1 D 1 A B 1
Machine1 Reduce B C Map apple 1 D the Reduce 2 Machine 2 1 Output File A B Map 1 apple C cat 1 3 the Reduce 3 5 cat Machine 3 1 2 big A C Map 2 dog Reduce big 1 D 1 Machine 4 A B Map Reduce dog 1 1 D

Executing Map Reduce B C 1 1 D 3 2 5 1 2 A B 2 C 1 3 1 A C 1 D 1 A B 1
Machine1 Reduce Output File B C Map apple 1 1 apple D 3 the the Reduce 2 5 cat Machine 2 1 2 big A B Map 2 dog C cat 1 Reduce 3 If part of the file or any intermediate computation is lost we can simply recompute it without recomputing everything. Machine 3 1 A C Map Reduce big 1 D 1 Machine 4 A B Map Reduce dog 1 1 D

Map Reduce Technologies

Hadoop First open-source map-reduce software Based on Google’s
Managed by Apache foundation Based on Google’s Google File System MapReduce Companies formed around Hadoop: Cloudera Hortonworks MapR

Hadoop Very active open source ecosystem Several key technologies
HDFS: Hadoop File System MapReduce: map-reduce compute framework YARN: Yet another resource negotiator Hive: SQL queries over MapReduce … Downside: Tedious to use! Joey: Word count example from before is 100s of lines of Java code. Key takeaway: This world was painful. Everything was in Java, which means it was very verbose, and it was pretty slow. The word count example from the previous page is ~100s of lines of Java.

In-Memory Dataflow System
Developed at the UC Berkeley AMP Lab A follow on to MapReduce – which writes everything to disk – cache everything in memory! Plus, don’t need to cache for fault-tolerance – recompute everything for efficiency (lineage based fault tolerance). M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. HotCloud’10 M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012

What Is Parallel execution engine for big data processing
General: efficient support for multiple workloads Easy to use: 2-5x less code than Hadoop MR High level API’s in Python, Java, and Scala Fast: up to 100x faster than Hadoop MR Can exploit in-memory when available Low overhead scheduling, optimized engine

Spark Programming Abstraction
Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Distributed collections of objects that can stored in memory or on disk Built via parallel transformations (map, filter, …) Automatically rebuilt on failure ----- Meeting Notes (6/17/14 16:13) ----- This slide doesn't present well Slide provided by M. Zaharia

RDD: Resilient Distributed Datasets
Collections of objects partitioned & distributed across a cluster Stored in RAM or on Disk Resilient to failures Operations Transformations Actions

Operations on RDDs Transformations f(RDD) => RDD Actions:
Lazy (not computed immediately) E.g., “map”, “filter”, “groupBy” Actions: Triggers computation E.g. “count”, “collect”, “saveAsTextFile”

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD lines = spark.textFile(“hdfs://file.txt”) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver Add “variables” to the “functions” in functional programming Action messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver Partition 1 Partition 2 Partition 3 Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker tasks Partition 1 Partition 2 Partition 3 Driver tasks Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() tasks Worker Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Read HDFS Partition Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Worker Worker Read HDFS Partition Read HDFS Partition

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Process & Cache Data Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker Cache 3 Worker Process & Cache Data Process & Cache Data

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker results Partition 1 Partition 2 Partition 3 Driver results Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 results Worker Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker tasks Partition 1 Partition 2 Partition 3 Driver tasks Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 tasks Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Process from Cache Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker Process from Cache Process from Cache

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker results Partition 1 Partition 2 Partition 3 Driver results Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 results Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Cache your data  Faster Results Full-text search of Wikipedia 60GB on 20 EC2 machines 0.5 sec from mem vs. 20s for on-disk Worker

Example: Counting Words
moby_dick = spark.textFile(“hdfs://books/mobydick.txt”) lines = moby_dick.flatMap(lambda line: line.split(“ ”)) counts = lines.map(lambda word: (word, 1)) .reduceByKey(lambda word: (word, 1)) Much simpler than Hadoop map reduce code! counts.toDF().toPandas() The flatMap and map calls produce transformed RDDs. Computation is lazy! Nothing happens until we get to our action. The reduceByKey calls kicks off the actual computing.

Abstraction: Dataflow Operators
filter groupBy sort union join leftOuterJoin rightOuterJoin map reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

Spark Demo

Summary (1/2) ETL is used to bring data from operational data stores into a data warehouse. Many ways to organize tabular data warehouse, e.g. star and snowflake schemas. Online Analytics Processing (OLAP) techniques let us analyze data in data warehouse. Examples: Pivot table, CUBE, slice, dice, rollup, drill down. Unstructured data is hard to store in a tabular format in a way that is amenable to standard techniques, e.g. finding pictures of cats. Resulting new paradigm: The Data Lake. 1. Talk more about databases, how we design data models, etc and cover how real enterprises think about data management, both traditionally and in a more modern sense 2. Discuss how modern systems accomplish distribution 3. Spark

Summary (2/2) Data Lake is enabled by two key ideas:
Distributed file storage. Distributed computation. Distributed file storage involves replication of data. Better speed and reliability, but more costly. Distributed computation made easier by map reduce. Hadoop: Open-source implementation of distributed file storage and computation. Spark: Typically faster and easier to use than Hadoop. 1. Talk more about databases, how we design data models, etc and cover how real enterprises think about data management, both traditionally and in a more modern sense 2. Discuss how modern systems accomplish distribution 3. Spark

Big Data Analytics Map-Reduce and Spark

Similar presentations

Presentation on theme: "Big Data Analytics Map-Reduce and Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Analytics Map-Reduce and Spark

Similar presentations

Presentation on theme: "Big Data Analytics Map-Reduce and Spark"— Presentation transcript:

Similar presentations

About project

Feedback