CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
Algorithms of Google News An Analysis of Google News Personalization Scalable Online Collaborative Filtering 1.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Analysis of Algorithms
MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.
CSCE 3400 Data Structures & Algorithm Analysis
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
CS7380: Privacy Aware Computing Oblivious RAM 1. Motivation  Starting from software protection Prevent from software piracy A valid method is using hardware.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Calculating frequency moments of Data Stream
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Experimental Study on the Five Sort Algorithms You Yang, Ping Yu, Yan Gan School of Computer and Information Science Chongqing Normal University Chongqing,
IT 60101: Lecture #181 Foundation of Computing Systems Lecture 18 Sorting Algorithms III.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Marr CollegeHigher Software DevelopmentSlide 1 Higher Computing Software Development Topic 4: Standard Algorithms.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CS203 Lecture 14. Hashing An object may contain an arbitrary amount of data, and searching a data structure that contains many large objects is expensive.
Image taken from: slideshare
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Fundamentals of Programming II Bucket Sort: An O(N) Sort Algorithm
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
LECTURE 01: Introduction to Algorithms and Basic Linux Computing
Large-scale file systems and Map-Reduce
Searching – Linear and Binary Searches
MapReduce Types, Formats and Features
Map Reduce.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
Searching.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Objective of This Course
Chapter 3: The Efficiency of Algorithms
CS110: Discussion about Spark
Lecture 7: Index Construction
Searching CLRS, Sections 9.1 – 9.3.
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
CS639: Data Management for Data Science
CS639: Data Management for Data Science
CS639: Data Management for Data Science
COS 518: Distributed Systems Lecture 11 Mike Freedman
Invitation to Computer Science 5th Edition
CS639: Data Management for Data Science
Presentation transcript:

CS639: Data Management for Data Science Lecture 8: Reasoning about Scale & The MapReduce Abstraction Theodoros Rekatsinas

Logistics/Announcements Submission template for PA2 Bonus problem for PA2 Other questions on PA2?

Today’s Lecture Scalability and Algorithmic Complexity Data-Parallel Algorithms The MapReduce Abstraction

1. Scalability and Algorithmic Complexity

What does scalable mean? Operationally: Works even if the data does not fit in main memory Use all available resources (cores/memory) on a single node (aka scale up) Can make use of 1000s of cheap computers (cloud) – elastic (aka scale out) Algorithmically: If you have N data items you should not perform more than Nm operations (polynomial complexity) In many cases it should be N*log(N) operations (streaming or too large data) If you have N data items, you must do no more than Nm/k operations for some large k (k = number of cores/threads)

A sketch of algorithmic complexity Example: Find matching string sequences Given a set of string sequences Find all sequences equal to “GATTACGA”

Example: Find matching string sequences GATTACGA

Example: Find matching string sequences GATTACGA TACCTGCC Time = 0: TACCTGCC ? GATTACGA

Example: Find matching string sequences GATTACGA TACCTGCC Time = 0: TACCTGCC ? GATTACGA No move cursor to next data entry

Example: Find matching string sequences GATTACGA EFTAAGCA Time = 1: EFTAAGCA ? GATTACGA No move cursor to next data entry

Example: Find matching string sequences GATTACGA Time = 2: XXXXXXX ? GATTACGA No move cursor to next data entry

Example: Find matching string sequences GATTACGA Time = n: GATTACGA ? GATTACGA Yes! Output matching sequence

Example: Find matching string sequences GATTACGA If we have 40 records we need to perform 40 comparisons

Example: Find matching string sequences GATTACGA For N records we perform N comparisons The algorithmic complexity is order N: O(N)

What if we knew the sequences are sorted Increasing Lexicographic Order GATTACGA

What if we knew the sequences are sorted Increasing Lexicographic Order GATTACGA Time = 0: Start at 50% mark CTGTACA < GATTACGA

What if we knew the sequences are sorted Increasing Lexicographic Order GATTACGA Time = 1: Start at 50% mark CTGTACA < GATTACGA Skip to 75% mark (you know your sequence is in the second half)

What if we knew the sequences are sorted Increasing Lexicographic Order GATTACGA Time = 2: We are at the 75% mark TTGTCCA > GATTACGA Skip to 62.5% mark Match: GATTACGA = GATTACGA We find our sequence in three steps. Now we can scan entries sequentially.

What if we knew the sequences are sorted Increasing Lexicographic Order GATTACGA How many comparisons? For N records we did log(N) comparisons The algorithm has complexity O(log(N)) — much better scalability

2. Data-Parallel Algorithms

New task: Trim string sequences Given a set of string sequences Trim the final n characters of each sequence Generate a new dataset

New task: Trim string sequences (last 3 chars) TACCTGCC Time = 0: TACCTGCC -> TACCTG

New task: Trim string sequences (last 3 chars) GATTCTGC Time = 1: GATTCTGC -> GATTC

New task: Trim string sequences (last 3 chars) CCCGAAT Time = 2: CCCGAAT -> CCCG Can we use a data structure to speed this operation?

New task: Trim string sequences (last 3 chars) CCCGAAT Time = 2: CCCGAAT -> CCCG Can we use a data structure to speed this operation? No. We have to touch every record! The task is O(N).

New task: Trim string sequences (last 3 chars)

New task: Trim string sequences (last 3 chars)

New task: Trim string sequences (last 3 chars) Time = 1: Process first element of each group

New task: Trim string sequences (last 3 chars) Time = 2: Process second element of each group

New task: Trim string sequences (last 3 chars) Time = 3: Process third element of each group Etc.. How much time does this take?

New task: Trim string sequences (last 3 chars) We only need O(N/k) operations where k is the number of groups (workers)

Schematic of Parallel Algorithms 1. You are given a set of “reads”. You have to process them and generate a “write” 2. You distribute the reads among k computers (workers)

Schematic of Parallel Algorithms 2. You distribute the reads among k computers (workers) 3. Apply function f to each read (for every item in each chunk) Function f Function f Function f Function f 4. Obtain a big distributed set of outputs

Applications of parallel algorithms Convert TIFF images to PNG Run thousands of simulations for different model parameters Find the most common word in each document Compute the word frequency of every word in a single document Etc….

Applications of parallel algorithms Convert TIFF images to PNG Run thousands of simulations for different model parameters Find the most common word in each document Compute the word frequency of every word in a single document Etc…. There is a common pattern in all these applications

Applications of parallel algorithms A function that maps a string to a trimmed string A function that maps a TIFF images to a PNG image A function that maps a set of parameters to simulation results A function that maps a document to its most common word A function that maps a document to a histogram of word frequencies

Applications of parallel algorithms What if we want to compute the word frequency across all documents?

3. The MapReduce Abstraction

Compute the word frequency across 5M documents

Compute the word frequency across 5M documents

Challenge: in this task How can we make sure that a single computer has access to every occurrence of a given word regardless of which document it appeared in? Ideas?

Compute the word frequency across 5M documents

Compute the word frequency across 5M documents

Compute the word frequency across 5M documents

Compute the word frequency across 5M documents A hash function is any function that can be used to map data of arbitrary size to a data of a fixed size

Compute the word frequency across 5M documents

The Map Reduce Abstraction for Distributed Algorithms Distributed Data Storage Map (Shuffle) Reduce