Big Data Analytics & Spark

Slides:

Advertisements

Similar presentations

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

Advertisements

Supervisor : Prof . Abbdolahzadeh

BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

Organizing Data & Information

Ch1: File Systems and Databases Hachim Haddouti

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

Chapter 14 The Second Component: The Database.

Chapter 13 The Data Warehouse

Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

Sayed Ahmed Logical Design of a Data Warehouse.  Free Training and Educational Services  Training and Education in Bangla: Training and Education in.

Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

MIS2502: Data Analytics The Information Architecture of an Organization.

1 CS : Project Suggestions Ion Stoica and Ali Ghodsi ( September 14, 2015.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.

Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

Supervisor : Prof . Abbdolahzadeh

Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Pertemuan <<13>> Data Warehousing dan Decision Support

Chapter 13 Business Intelligence and Data Warehouses

Data warehouse and OLAP

Chapter 13 The Data Warehouse

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Data Warehouse.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

MIS2502: Data Analytics Dimensional Data Modeling

CS6604 Digital Libraries IDEAL Webpages Presented by

Chapter 1 Database Systems

Cse 344 May 4th – Map/Reduce.

Ch 4. The Evolution of Analytic Scalability

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.

XtremeData on the Microsoft Azure Cloud Platform:

Overview of big data tools

Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

Big Data Analytics Map-Reduce and Spark

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Interpret the execution mode of SQL query in F1 Query paper

CMPT 354: Database System I

Introduction of Week 9 Return assignment 5-2

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Business Intelligence

Data Warehousing Concepts

Chapter 3 Database Management

CS : Project Suggestions

Analytics, BI & Data Integration

Fast, Interactive, Language-Integrated Cluster Computing

UNIT 6 RECENT TRENDS.

Presentation transcript:

Big Data Analytics & Spark Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu With revisions by: Josh Hug & John DeNero

Data in the Organization Operational Data Store Data Warehouse Data in the Organization ETL (Extract, Transform, Load) Snowflake Schema Schema on Read A little bit of buzzword bingo! Star Schema Data Lake OLAP (Online Analytics Processing)

How we like to think of data in the organization Inventory How we like to think of data in the organization

The reality… Sales (Asia) Sales (US) Inventory Advertising

Operational Data Stores Data Warehouse Operational Data Stores Sales (Asia) Capture the now Many different databases across an organization Mission critical… be careful! Serving live ongoing business operations Managing inventory Different formats (e.g., currency) Different schemas (acquisitions ...) Live systems often don’t maintain history Collects and organizes historical data from multiple sources Sales (US) Inventory Why do we want history: What was the balance of my account last week? We don’t have that; only the current state. What if you want analysis over time? Advertising We would like a consolidated, clean, historical snapshot of the data.

Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources Sales (US) Inventory Data is periodically ETLed into the data warehouse: Extracted from remote sources Transformed to standard schemas Loaded into the (typically) relational (SQL) data system Advertising

Extract  Transform  Load (ETL) Extract & Load: provides a snapshot of operational data Historical snapshot Data in a single system Isolates analytics queries (e.g., Deep Learning) from business critical services (e.g., processing user purchase) Easy! Transform: clean and prepare data for analytics in a unified representation Difficult  often requires specialized code and tools Different schemas, encodings, granularities What could go wrong here? Don’t want to drop data silently. System goes down when you make the request. Input data format changes. (This happens!)

Data Warehouse How is data organized in the Data Warehouse? ETL Sales (Asia) ETL Collects and organizes historical data from multiple sources Sales (US) Inventory How is data organized in the Data Warehouse? Advertising

Example Sales Data Big table: many columns and rows pname category price qty date day city state country Corn Food 25 3/30/16 Wed. Omaha NE USA 8 3/31/16 Thu. 15 4/1/16 Fri. Galaxy Phones 18 30 1/30/16 20 50 Peanuts 2 45 Seoul Korea 100 Big table: many columns and rows Substantial redundancy  expensive to store and access Make mistakes while updating Could we organize the data more efficiently? Why would we want to normalize? Let’s say we wanted to change the price of an item…SQL queries can get really messy and expensive when you have a lot of repeated data.

Multidimensional Data Model Sales Fact Table Locations Dimension Tables pid timeid locid sales 11 1 25 2 8 3 15 12 30 20 50 13 10 35 22 26 45 40 5 locid city state country 1 Omaha Nebraska USA 2 Seoul Korea 5 Richmond Virginia Products Multidimensional “Cube” of data pid pname category price 11 Corn Food 25 12 Galaxy 1 Phones 18 13 Peanuts 2 Time Id 1 2 3 25 8 15 30 20 50 10 11 12 13 5 Product Id Location Id Time timeid Date Day 1 3/30/16 Wed. 2 3/31/16 Thu. 3 4/1/16 Fri.

Multidimensional Data Model Sales Fact Table Locations Dimension Tables pid timeid locid sales 11 1 25 2 8 3 15 12 30 20 50 13 10 35 22 26 45 40 5 locid city state country 1 Omaha Nebraska USA 2 Seoul Korea 5 Richmond Virginia Products Fact Table Minimizes redundant info Reduces data errors Dimensions Easy to manage and summarize Rename: Galaxy1  Phablet Normalized Representation How do we do analysis? Joins! pid pname category price 11 Corn Food 25 12 Galaxy 1 Phones 18 13 Peanuts 2 Normalization does not refer to statistical normalization – if you want to learn more, take CS 186. There are many forms of normalization, but we won’t cover that here. --Hidden since I talked about this in the Advanced SQL lecture. Someone put it there in Sp18, but I think it should be in this lecture instead. Time timeid Date Day 1 3/30/16 Wed. 2 3/31/16 Thu. 3 4/1/16 Fri.

The Star Schema Products Time Sales Fact Table Locations pid pname category price timeid Date Day Sales Fact Table pid timeid locid sales  This looks like a star … Collection of tables with the fact table at the center and a variety of key relationships. Locations locid city state country

The Star Schema  This looks like a star …

The Star Schema  This looks like a star …

The Snowflake Schema  This looks like a snowflake … The dimension tables become fact tables themselves => snowflake schemas. All of these are mentioned very commonly in data warehouses. See CS 186 for more!

Online Analytics Processing (OLAP) Users interact with multidimensional data: Constructing ad-hoc and often complex SQL queries Using graphical tools that to construct queries Sharing views that summarize data across important dimensions Okay, so now we understand multidimensional data models. How do users interact with this data? Joey’s original version of this slide mentions views, but they aren’t covered in the previous lecture, nor are they covered in this part of the lecture other than this bullet point. I’ve effectively removed it by hiding this version of the slide to keep the narrative streamlined. -Josh

Reporting and Business Intelligence (BI) Use high-level tools to interact with their data: Automatically generate SQL queries Queries can get big! Common! All of this is often done via BI tools that automatically generate these queries – people usually don’t write these queries by hand in practice.

Data Warehouse So far … Star Schemas Data cubes OLAP ETL ETL ETL ETL Sales (Asia) ETL Collects and organizes historical data from multiple sources Sales (US) ETL ETL Inventory So far … Star Schemas Data cubes OLAP ETL Advertising

Data Warehouse ETL ETL ETL ETL ETL ? Sales (Asia) Data Warehouse ETL Sales (US) Collects and organizes historical data from multiple sources ETL Inventory ETL Advertising ETL How do we deal with semi-structured and unstructured data? Do we really want to force a schema on load? ETL ? in traditional ETL, everything had to conform to predetermined schema before load happened... doesn't necessarily work with this kind of data -- especially audio / video / images Text/Log Data Photos & Videos It is Terrible!

Data Warehouse Collects and organizes historical data from multiple sources iid date_taken is_cat is_grumpy image_data 451231333 01-22-2016 1 472342122 06-17-2017 571827231 03-15-2009 238472733 05-18-2018 How do we deal with semi-structured and unstructured data? Do we really want to force a schema on load? Unclear what a good schema for this image data might look like. Something like above will work, but it is inflexible!

Data Warehouse Data Lake * ETL ETL ETL ETL ETL ? Sales (Asia) Data Lake * How do we clean and organize this data? ETL Sales (US) Store a copy of all the data in one place in its original “natural” form Enable data consumers to choose how to transform and use data. Schema on Read ETL Inventory Depends on use … ETL Advertising ETL ETL ? How do we load and process this data in a relation system? Text/Log Data Depends on use … Can be difficult ... Requires thought ... Photos & Videos It is Terrible! What could go wrong? *Still being defined…[Buzzword Disclaimer]

The Dark Side of Data Lakes Cultural shift: Curate  Save Everything! Noise begins to dominate signal Limited data governance and planning Example: hdfs://important/joseph_big_file3.csv_with_json What does it contain? When and who created it? No cleaning and verification  lots of dirty data New tools are more complex and old tools no longer work Enter the data scientist

A Brighter Future for Data Lakes Enter the data scientist Data scientists bring new skills Distributed data processing and cleaning Machine learning, computer vision, and statistical sampling Technologies are improving SQL over large files Self describing file formats (e.g. Parquet) & catalog managers Organizations are evolving Tracking data usage and file permissions New job title: data engineers

How do we store and compute on large unstructured datasets Requirements: Handle very large files spanning multiple computers Use cheap commodity devices that fail frequently Distributed data processing quickly and easily Solutions: Distributed file systems  spread data over multiple machines Assume machine failure is common  redundancy Distributed computing  load and process files on multiple machines concurrently Functional programming computational pattern  parallelism

Distributed File Systems Storing very large files

Fault Tolerant Distributed File Systems Big File How do we store and access very large files across cheap commodity devices ? [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File A B B B Split the file into smaller parts. How? Ideally at record boundaries What if records are big? C C C D D D Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File B B B C C C D D D Machine1 Machine 2 Machine 3 Machine 4 A A A [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File C C C D D D Machine1 Machine 2 Machine 3 Machine 4 B A B A A B [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File D D D Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B C [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Big File Split large files over multiple machines Easily support massive files spanning machines Read parts of file in parallel Fast reads of large files Often built using cheap commodity storage devices Machine1 Machine 2 B C A B D C Machine 3 Machine 4 Cheap commodity storage devices will fail! A C A B D D [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Failure Event Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Failure Event Big File Machine1 B C D Machine 2 Machine 3 A C D Machine 4 A B A B C D [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Failure Event Big File Machine 2 Machine 4 A A B A B B C C D D [Ghemawat et al., SOSP’03]

Fault Tolerant Distributed File Systems Failure Event Big File B C D Machine 2 Machine 4 A B A B C D [Ghemawat et al., SOSP’03]

Distributed Computing

In-Memory Dataflow System Developed at the UC Berkeley AMP Lab A follow on to MapReduce – which writes everything to disk – cache everything in memory! Plus, don’t need to cache for fault-tolerance – recompute everything for efficiency (lineage based fault tolerance). M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. HotCloud’10 M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012

Spark Programming Abstraction Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Distributed collections of objects that can stored in memory or on disk Built via parallel transformations (map, filter, …) Automatically rebuilt on device failure ----- Meeting Notes (6/17/14 16:13) ----- This slide doesn't present well Slide provided by M. Zaharia

Operations on RDDs Transformations f(RDD) => RDD Actions: Lazy (not computed immediately) E.g., “map”, “filter”, “groupBy” Actions: Triggers computation E.g. “count”, “collect”, “saveAsTextFile”

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD lines = spark.textFile(“hdfs://file.txt”) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Driver Add “variables” to the “functions” in functional programming

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver Add “variables” to the “functions” in functional programming Action messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver Partition 1 Partition 2 Partition 3 Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker tasks Partition 1 Partition 2 Partition 3 Driver tasks Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() tasks Worker Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Read HDFS Partition Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Worker Worker Read HDFS Partition Read HDFS Partition

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Process & Cache Data Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker Cache 3 Worker Process & Cache Data Process & Cache Data

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker results Partition 1 Partition 2 Partition 3 Driver results Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 results Worker Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker tasks Partition 1 Partition 2 Partition 3 Driver tasks Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 tasks Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Process from Cache Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker Process from Cache Process from Cache

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker results Partition 1 Partition 2 Partition 3 Driver results Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 results Worker messages.filter(lambda s: “php” in s).count() Cache 3 Worker

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://file.txt”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Partition 1 Partition 2 Partition 3 Driver Add “variables” to the “functions” in functional programming messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Cache your data  Faster Results Full-text search of Wikipedia 60GB on 20 EC2 machines 0.5 sec from mem vs. 20s for on-disk Worker

Spark Demo

Summary (1/2) ETL is used to bring data from operational data stores into a data warehouse. Many ways to organize tabular data warehouse, e.g. star and snowflake schemas. Online Analytics Processing (OLAP) techniques let us analyze data in data warehouse. Unstructured data is hard to store in a tabular format in a way that is amenable to standard techniques, e.g. finding pictures of cats. Resulting new paradigm: The Data Lake. 1. Talk more about databases, how we design data models, etc and cover how real enterprises think about data management, both traditionally and in a more modern sense 2. Discuss how modern systems accomplish distribution 3. Spark

Summary (2/2) Data Lake is enabled by two key ideas: Distributed file storage. Distributed computation. Distributed file storage involves replication of data. Better speed and reliability, but more costly. Distributed computation made easier by map reduce. Hadoop: Open-source implementation of distributed file storage and computation. Spark: Typically faster and easier to use than Hadoop. 1. Talk more about databases, how we design data models, etc and cover how real enterprises think about data management, both traditionally and in a more modern sense 2. Discuss how modern systems accomplish distribution 3. Spark