Mining Massive Datasets: Review

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
PARTITIONAL CLUSTERING
Mining Data Streams.
Mining of Massive Datasets: Course Introduction
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Link Analysis: PageRank
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Near Duplicate Detection
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Data Mining – Intro.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Overview of Web Data Mining and Applications Part I
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
Data Mining Chun-Hung Chou
Data mining and machine learning A brief introduction.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining By: Johan Johansson. Mining Techniques Association Rules Association Rules Decision Trees Decision Trees Clustering Clustering Nearest Neighbor.
Mining of Massive Datasets Edited based on Leskovec’s from
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Book web site:
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
CSE 4705 Artificial Intelligence
Mining Data Streams (Part 1)
Data Mining – Intro.
Data Mining ICCM
k-Nearest neighbors and decision tree
Near Duplicate Detection
Intro to Machine Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Data Mining 101 with Scikit-Learn
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Basic machine learning background with Python scikit-learn
CS341: Project in Mining Massive Datasets Infosession
Waikato Environment for Knowledge Analysis
Community detection in graphs
Mining Data Streams (Part 1)
CPS216: Advanced Database Systems Data Mining
Mining Data Streams (Part 2)
Data Mining Modified from
Sangeeta Devadiga CS 157B, Spring 2007
Q4 : How does Netflix recommend movies?
I don’t need a title slide for a lecture
Artificial Intelligence Lecture No. 28
Topic 5: Cluster Analysis
Presentation transcript:

Mining Massive Datasets: Review CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Data Mining Discovery of patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern Subsidiary issues: Data cleansing: detection of bogus data Visualization: something better than MBs of output Warehousing of data (for retrieval) 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Mining Massive Datasets Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and architectures automation for handling large data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

What We Have Covered Apriori MapReduce Association rules Frequent itemsets PCY Recommender systems PageRank TrustRank HITS SVM Decision Trees Perceptron Web Advertising DGIM method Margin BFR Locality Sensitive Hashing Random hyperplanes MinHash SVD CUR Clustering Matrix factorization Bloom filters Flajolet-Martin method CURE Stochastic Gradient Descent Collaborative Filtering SimRank Spectral clustering AND-OR constructions K-means 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How It All Fits Together Based on different types of data: Data is high dimensional Data is a graph Data is never-ending Data is labeled Based on different models of computation: MapReduce Streams and online algorithms Single machine in-memory 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How It All Fits Together Based on different applications: Recommender systems Association rules Link analysis Duplicate detection Based on different “tools”: Linear algebra (SVD, Rec. Sys., Communities) Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How It All Fits Together High dim. data Locality sensitive hashing Clustering Dimensionality reduction Graph data PageRank, SimRank Community Detection Spam Detection Infinite data Filtering data streams Web advertising Queries on streams Machine learning SVM Decision Trees Perceptron, kNN Apps Recommender systems Association Rules Duplicate document detection 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How it all fits together? Data is High-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering Data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Data is Labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

(1) Finding “similar” sets Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the document Shingling: convert docs to sets Minhashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing: focus on pairs of signatures likely to be similar 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

(2) Dimensionality Reduction   VT m U 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

(3) Clustering Hierarchical: Point Assignment: Agglomerative (bottom up): Initially, each point is a cluster Repeatedly combine the two “nearest” clusters into one Represent a cluster by its centroid or clustroid Point Assignment: Maintain a set of clusters Points belong to “nearest” cluster 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

High-dim data methods: Comparison LSH: Find somewhat similar pairs of items while avoiding O(N2) comparisons Clustering: Assign points into a prespecified number of clusters Each point belongs to a single cluster Summarize the cluster by a centroid (e.g., topic vector) SVD (dimensionality reduction): Want to explore correlations in the data Some dimensions may be irrelevant Useful for visualization, removing noise from the data, detecting anomalies 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How it all fits together? Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Link Analysis: PageRank Rank nodes using link structure PageRank: Link voting: P with importance x has n out-links, each link gets x/n votes Page R’s importance is the sum of the votes on its in-links Complications: Spider traps, Dead-ends At each step, random surfer has two options: With probability , follow a link at random With prob. 1-, jump to some page uniformly at random 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

TrustRank & SimRank TrustRank: topic-specific PageRank with a teleport set of “trusted” pages SimRank: measure similarity between items a k-partite graph with k types of nodes Example: picture nodes and tag nodes Perform a random-walk with restarts from node N i.e., teleport set = {N} Resulting prob. distribution measures similarity to N 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Analysis of Large Graphs Detecting clusters of densely connected nodes Trawling: discover complete bipartite subgraphs Frequent itemset mining Graph partitioning: “cut” few edges to separate the graph in two pieces Spectral clustering Graph Laplacian 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How it all fits together? Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Support Vector Machines Prediction = sign(wx + b) Model parameters w, b Margin: SVM optimization problem: Find w,b using Stochastic gradient descent + + wx+b=0 + - + i + + - i - + + - - - 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Decision Trees: PLANET Building decision trees using MapReduce How to predict? Predictor: avg. yi of the examples in the leaf When to stop? # of examples in the leaf is small How to build? One MapReduce job per level Need to compute split quality for each attribute and each split value for each current leaf A B X1<v1 C D E F G H I |D|=90 |D|=10 X2<v2 X3<v4 X2<v5 |D|=45 .42 |D|=25 |D|=20 |D|=30 |D|=15 FindBestSplit FindBestSplit FindBestSplit 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu FindBestSplit

When to use which method SVM: Classification Millions of numerical features (e.g., documents) Simple (linear) decision boundary Hard to interpret model kNN: Classification or regression (Many) numerical features Many parameters to play with –distance metric, k, weighting, … there is no simple way to set them! Decision Trees: Classification or Regression Relatively few features (handles categorical features) Complicated decision boundary: Overfitting! Easy to explain/interpret the classification Bagged Decision Trees – very very hard to beat! 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How it all fits together? Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Mining Data Streams Ad-Hoc Queries Standing . . . 1, 5, 2, 7, 0, 9, 3 Processor Standing Queries . . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering Output Limited Working Storage Archival Storage 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Problems on data streams Sampling data from a stream: Each element is included with prob. k/N Queries over sliding windows: How many 1s are in last k bits? Filtering a stream: Bloom filters Select elements with property x from stream Counting distinct elements: Number of distinct elements in the last k elements of the stream 1001010110001011010101010101011010101010101110101010111010100010110010 Output the item since it may be in S; Filter Item Drop the item hash func h Bit array B 0010001011000 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Online algorithms & Advertising You get to see one input piece at a time, and need to make irrevocable decisions Competitive ratio = minall inputs I (|Mmy_alg|/|Mopt|) Addwords problem: Query arrives to a search engine Several advertisers bid on the query query Pick a subset of advertisers whose ads are shown Greedy online matching: competitive ratio  1/2 1 2 3 4 a b c d Boys Girls 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How it all fits together? Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Association Rule Discovery Supermarket shelf management – Market-basket model: Goal: To identify items that are bought together by sufficiently many customers Approach: Process the sales data collected with barcode scanners to find dependencies among items Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Recommender Systems User-user collaborative filtering Consider user c Find set D of other users whose ratings are “similar” to c’s ratings Estimate user’s ratings based on the ratings of users in D Item-item collaborative filtering Estimate rating for item based on ratings for similar items Recommendations Search Items 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Latent Factor Models: Netflix user bias movie bias user-movie interaction Baseline predictor Separates users and movies Benefits from insights into user’s behavior User-Movie interaction Characterizes the matching between users and movies Attracts most research in the field 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

In Closing… http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Map of Superpowers High dim. data Graph data Infinite data Locality sensitive hashing Clustering Dimensionality reduction Graph data PageRank, SimRank Community Detection Spam Detection Infinite data Filtering data streams Web advertising Queries on streams Machine learning SVM Decision Trees Perceptron, kNN Apps Recommender systems Association Rules Duplicate document detection 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

What we’ve learned this quarter MapReduce Association Rules Apriori algorithm Finding Similar Items Locality Sensitive Hashing Random Hyperplanes Dimensionality Reduction Singular Value Decomposition CUR method Clustering Recommender systems Collaborative filtering PageRank and TrustRank Hubs & Authorities k-Nearest Neighbors Perceptron Support Vector Machines Stochastic Gradient Descent Decision Trees Mining data streams Bloom Filters Flajolet-Martin Advertising on the Web 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

How do you want that data? I ♥ data How do you want that data? 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Applying Your Superpowers We are producing more data than we are able to store! [The economist, 2010] 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Applying Your Superpowers 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

THE BIG PICTURE How to analyze large datasets to discover patterns and models that are: valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern How to do this using massive data (that does not fit into main memory) 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

What next? Seminars Seminars: Conferences: InfoSeminar: http://i.stanford.edu/infoseminar RAIN Seminar: http://rain.stanford.edu Conferences: KDD: ACM Conference on Knowledge Discovery and Data Mining ICDM: IEEE International Conference on Data Mining WWW: World Wide Web Conference ICML: International Conference on Machine Learning NIPS: Neural Information Processing Systems 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

CS341: Project in Data Mining CS341: Project in Data Mining (Spring 2012) Research project on big data Groups of 3 students We provide interesting data, computing resources and mentoring You provide project ideas Information session: Monday 3/19 5pm in Gates 415 (there will be pizza) 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

What Next? Courses Other relevant courses CS224W: Social and Information Network Analysis CS276: Information Retrieval and Web Search CS229: Machine Learning CS245: Database System Principles CS347: Transaction Processing and Distributed Databases CS448g: Interactive Data Analysis 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Thank You for the Hard Work!!! In Closing You Have Done a Lot!!! And (hopefully) learned a lot!!! Answered questions and proved many interesting results Implemented a number of methods And did excellently on the final! Thank You for the Hard Work!!! 9/22/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu