Big Data Reading Group Grigory Yaroslavtsev 361 Levine

Slides:



Advertisements
Similar presentations
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Advertisements

Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence. Anna GalUT Austin Parikshit GopalanU. Washington.
Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.
Introduction to Algorithms 6.046J/18.401J
Incremental Recomputations in MapReduce
Improved Approximation for the Directed Spanner Problem Grigory Yaroslavtsev Penn State + AT&T Labs - Research (intern) Joint work with Berman (PSU), Bhattacharyya.
Fast Johnson-Lindenstrauss Transform(s) Nir Ailon Edo Liberty, Bernard Chazelle Bertinoro Workshop on Sublinear Algorithms May 2011.
Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc,
Computational Complexity, Choosing Data Structures Svetlin Nakov Telerik Corporation
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
I/O-Algorithms Lars Arge Spring 2012 April 17, 2012.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Great Theoretical Ideas in Computer Science
110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Property Testing and Communication Complexity Grigory Yaroslavtsev
Scalable and Dynamic Quorum Systems Moni Naor & Udi Wieder The Weizmann Institute of Science.
Highlights From the Survey on the Use of Funds Under Title II, Part A
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Parallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev 361 Levine STOC 2014, joint work with Alexandr Andoni, Krzysztof.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014.
Parallel Algorithms for Geometric Graph Problems Alex Andoni (Microsoft Research) Joint with: Aleksandar Nikolov (Rutgers), Krzysztof Onak (IBM), Grigory.
On Sketching Quadratic Forms Robert Krauthgamer, Weizmann Institute of Science Joint with: Alex Andoni, Jiecao Chen, Bo Qin, David Woodruff and Qin Zhang.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
C OMMUNICATION S TEPS F OR P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2013.
Sketching, Sampling and other Sublinear Algorithms: Algorithms for parallel models Alex Andoni (MSR SVC)
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Approximation for Directed Spanner Grigory Yaroslavtsev Penn State + AT&T Labs (intern) Based on a paper at ICALP’11, joint with Berman (PSU), Bhattacharyya.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
 DATA STRUCTURE DATA STRUCTURE  DATA STRUCTURE OPERATIONS DATA STRUCTURE OPERATIONS  BIG-O NOTATION BIG-O NOTATION  TYPES OF DATA STRUCTURE TYPES.
MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
11 Algorithmic Techniques for Massive Data (COMS ) Alex Andoni.
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Mining of Massive Datasets Edited based on Leskovec’s from
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Grigory Yaroslavtsev Clustering on Clusters: Massively Parallel Algorithms for Clustering Graphs and Vectors vs Grigory Yaroslavtsev.
Distributed Computation Framework for Machine Learning
Grigory Yaroslavtsev Clustering on Clusters 2049: Massively Parallel Algorithms for Clustering Graphs and Vectors vs Grigory Yaroslavtsev.
Augmented Sketch: Faster and More Accurate Stream Processing
Kijung Shin1 Mohammad Hammoud1
CIS 700: “algorithms for Big Data”
Parallel Algorithms for Geometric Graph Problems
CIS 700: “algorithms for Big Data”
Linear sketching with parities
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Sublinear Algorihms for Big Data
Introduction to Multiprocessors
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Lecture 6: Counting triangles Dynamic graphs & sampling
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
Presentation transcript:

Big Data Reading Group Grigory Yaroslavtsev 361 Levine

Reading group format Weekly meetings: 3:30pm, Towne 311 Participation-driven format – Pick a paper to discuss – Select a volunteer to present – Participants look at the paper before the meeting – The volunteer explains technical details and leads the discussion – More informal than a seminar (presentation not necessary, can use the board, the paper, notes, etc.)

Basics

Part 1: Massive Parallel Computation Very large data (graphs) Enough space to store them distributedly Not enough time to compute. Communication is a bottleneck

Computational Model S space

Computational Model

MapReduce-style computations

Models of parallel computation Bulk-Synchronous Parallel Model (BSP) [Valiant,90] Pro: Most general, generalizes all other models Con: Many parameters, hard to design algorithms Massive Parallel Computation [Feldman-Muthukrishnan- Sidiropoulos-Stein-Svitkina’07, Karloff-Suri-Vassilvitskii’10, Goodrich-Sitchinava-Zhang’11,..., Beame, Koutris, Suciu’13, Andoni, Onak, Nikolov, Y. ‘14] Pros: Inspired by modern systems (Hadoop, MapReduce, Dryad, … ) Few parameters, simple to design algorithms New algorithmic ideas, robust to the exact model specification # Rounds is an information-theoretic measure => can prove unconditional lower bounds Between linear sketching and streaming with sorting

Dense graphs vs. sparse graphs VS.

Papers Karloff, Suri, Vassilvitskii: A Model of Computation for MapReduce. SODA Feldman, Muthukrishnan, Sidiropoulos, Stein, Svitkina: On distributing symmetric streaming computations. SODA Lattanzi, Moseley, Suri, Vassilvitskii: Filtering: a method for solving graph problems in MapReduce. SPAA Bahmani, Moseley, Vattani, Kumar, Vassilvitskii: Scalable K-Means++. VLDB Suri, Vassilvitskii: Counting triangles and the curse of the last reducer. WWW Bahmani, Chakrabarti, Xin: Fast personalized PageRank on MapReduce. SIGMOD 2011.

Part 2: Streaming Algorithms Very large stream of numbers Not enough space even to store them

Data Streams

Problems on Data Streams

Papers Cormode, Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004, Imre Simon Award. Kane, Nelson, Woodruff: An optimal algorithm for the distinct elements problem. PODS 2010, Best Paper Award. Liberty: Simple and deterministic matrix sketching. KDD 2013, Best Paper Award. Jha, Seshadhri, Pinar: A space efficient streaming algorithm for triangle counting using the birthday paradox. KDD 2013, Best Student Paper Award. Das Sarma, Gollapudi, Panigrahy: Estimating PageRank on graph streams. PODS 2008, Best Paper Award.

Thank you! Next meeting: Friday, September 19, 3:30pm, Towne 311 Links to all papers are available at: