Welcome to RPI CS! Theory Group Professors: Mark Goldberg Associate Professors: Daniel Freedman, Mukkai Krishnamoorthy, Malik Magdon- Ismail, Bulent Yener.

Slides:



Advertisements
Similar presentations
Lecture 19: Parallel Algorithms
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Near-Duplicates Detection
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 1 (Part 1) Introduction/Overview Tuesday, 9/4/01.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
SARAH SPENCE ADAMS ASSOC. PROFESSOR OF MATHEMATICS AND ELECTRICAL & COMPUTER ENGINEERING Combinatorial Designs and Related Discrete Combinatorial Structures.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Near Duplicate Detection
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
 2007 Pearson Education, Inc. All rights reserved C Arrays.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems Databases Chapter 11.
Clustering Unsupervised learning Generating “classes”
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Efficient Algorithms for Matching Pedro Felzenszwalb Trevor Darrell Yann LeCun Alex Berg.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Week 2 CS 361: Advanced Data Structures and Algorithms
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Lecture 2 Computational Complexity
04/30/13 Last class: summary, goggles, ices Discrete Structures (CS 173) Derek Hoiem, University of Illinois 1 Image: wordpress.com/2011/11/22/lig.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
CMPS 1371 Introduction to Computing for Engineers MATRICES.
Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Medians and blobs Prof. Ramin Zabih
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Module #9: Matrices Rosen 5 th ed., §2.7 Now we are moving on to matrices, section 7.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
 2007 Pearson Education, Inc. All rights reserved C Arrays.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Mining of Massive Datasets Ch4. Mining Data Streams
1 CS 430: Information Discovery Lecture 5 Ranking.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Advanced Algorithms Analysis and Design
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Near Duplicate Detection
Streaming & sampling.
Lecture 22: Parallel Algorithms
Advanced Topics in Data Management
Objective of This Course
Lecture 6: Counting triangles Dynamic graphs & sampling
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Welcome to RPI CS! Theory Group Professors: Mark Goldberg Associate Professors: Daniel Freedman, Mukkai Krishnamoorthy, Malik Magdon- Ismail, Bulent Yener. Assistant Professors: Elliot Anshelevich, Sanmay Das. Petros Drineas (Assistant Professor)

2 Overview (My) Research Agenda Massive Datasets A glimpse of my work Future directions (Theory Group?) Teaching Agenda Shameless advertising We offer all the courses that you need in order to get a solid background in TCS. We use really cool and sometimes hard math in our courses!

3 Massive Datasets appear everywhere Just a few examples:  The World Wide Web. Too much junk, but Google usually saves the day!  Document Databases. IBM estimated that its databases contained more than 3 ¢ 10 8 documents.  AT&T call-detail records. These include calling number, called number, time of day, length of call for 260 Million calls/day.

4 Queries on Massive Datasets What kind of queries do we want to perform on Massive Datasets?  World Wide Web: search for Jaguar, but separate the web pages that refer to the animal from the ones that refer to the car.  Document Databases: find documents that are similar to a given document.  AT&T call-detail records: distinguish business accounts from residential accounts.  MANY, MANY, MANY MORE EXAMPLES/QUERIES! Definitely more research is necessary. A quote of Christos Papadimitriou: “CS might not have produced an implant chip that increases our intelligence, but we produced Google, which increases our IQ just as efficiently.”

5 Why is it difficult? 5 things to remember about Massive Datasets:  They are not readily accessible.  They are stored in secondary memory and we only have sequential access over the data.  We do not want to read the data too many times! Equivalently, we want to make a small number of passes over the data.  Sometimes, we are only allowed to make one pass over the data (streaming model).  Even worse, in certain applications we are not even allowed to read the full data once (sublinear algorithms). Imagine sorting 1 trillion numbers, stored in secondary memory (no random access there!). Thus, no QuickSort, MergeSort, etc. How do you do it fast? Not surprisingly, it is not easy! The Massive Dataset paradigm motivates looking for new algorithms, new complexity classes, new lower bounds, new proof techniques, etc.

6 My work In many applications data appear as large matrices We can make a few “passes” through the matrices. We can create and store a small “sketch” of the matrices in RAM. Computing the “sketch” should be a very fast process. Finally, discard the original matrix and work with the “sketch”. 1.A “sketch” consisting of a few rows/columns of the matrix is adequate for efficient approximations. 2.We draw the rows/columns randomly, using adaptive sampling; e.g. we pick rows/columns with larger elements with higher probability.

7 Applications: Data Mining We are given m (>10 6 ) documents and n (>10 5 ) terms describing the documents. Database An m-by-n matrix A (A ij shows the frequency of term j in document i). Every row of A represents a document. Queries Given a new document x, find similar documents in the database (nearest neighbors).

8 Applications (cont’d) Key observation: The exact value x T · d might not be necessary. 1.The feature values in the vectors are set by coarse heuristics. 2.It is in general enough to see if x T · d > Threshold. Two documents are “close” if the angle between their corresponding vectors is small. So, x T ·d = cos(x,d) is high when the two objects are close. (We assume that the vectors are normalized.) A·x computes all the angles and answers the query.

9 Approximating Matrix Multiplication Goal Proximity problem: identify all document-document matches in the database. Given an m-by-n matrix A, approximate the product A·A T. Idea 1.Pick s columns of A to form an m-by-s matrix S. 2.(discard A) Approximate A · A T by S · S T.

10 The algorithm It only requires storage of a few columns of A. Faster than full matrix multiplication. We can prove error bounds for this algorithm! We can use similar ideas to tackle various other matrix operations.

11 Future Directions Extract structure from data in the form of tensors (multi-dimensional arrays) instead of matrices (two-dimensional arrays). Think of the evolution of the graph (vertices are users and edges denote the exchange of an message) over time. This can be modelled as a three-dimensional array. Can we find communities of users in this time-evolving graph? Extract non-linear structure from data. A subtle point is that most of the existing techniques only extract linear structure from the data. Can we devise efficient algorithms to extract non- linear structure from massive datasets?

12 Teaching Agenda Graduate Courses offered by Theory Group members in the last few years … Topics in Computational Geometry (Freedman) Graph Theory (Goldberg) Computability and Complexity (Goldberg) Machine and Computational Learning (Magdon-Ismail) Introduction to Computational Finance (Magdon-Ismail) Random Graphs and The WEB-graph (Goldberg, Moorthy, Yener) Network Security (Yener) Network Flows and Linear Programming (Drineas, Yener) Randomized Algorithms (Drineas) Machine Learning (Das) Advanced Algorithm Design (Anshelevich)

13 Enjoy your (theory) classes!