Scalable Machine Learning

Slides:

Advertisements

Similar presentations

Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Advertisements

CS525: Special Topics in DBs Large-Scale Data Management

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Supervised Learning Recap

Unsupervised learning

1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

What is Cluster Analysis?

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Evaluating Performance for Data Mining Techniques

Machine Learning with EM 闫宏飞北京大学信息科学技术学院 7/24/2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.

Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Apache Mahout Qiaodi Zhuang Xijing Zhang.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Apache Mahout Industrial Strength Machine Learning Jeff Eastman.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Image taken from: slideshare

Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Machine Learning with Spark MLlib

Big Data is a Big Deal!.

Presented by: Javier Pastorino Fall 2016

Semi-Supervised Clustering

Machine Learning Clustering: K-means Supervised Learning

Industrial Strength Machine Learning Jeff Eastman

Tutorial: Big Data Algorithms and Applications Under Hadoop

Introducing Apache Mahout

Classification with Perceptrons Reading:

Waikato Environment for Knowledge Analysis

CMPT 733, SPRING 2016 Jiannan Wang

Advanced Artificial Intelligence

Information Organization: Clustering

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Lecture 26 (Mahout Clustering)

INTRODUCTION TO Machine Learning

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

CSE 491/891 Lecture 25 (Mahout).

Text Categorization Berlin Chen 2003 Reference:

CMPT 733, SPRING 2017 Jiannan Wang

Introducing Apache Mahout

Presentation transcript:

Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

But What is Machine Learning “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Given a data set X, can we effectively predict Y by optimizing Z? Intro. to Machine Learning by E. Alpaydin

Supervised vs. Unsupervised Algorithms trained on labeled examples I know these images are of cats and these are of dogs, tell me if this image is a cat or a dog Algorithms trained on unlabeled examples Group these images together by similarity, i.e. some kind of distance function Number of approaches to machine learning, but these two are commonly used ones

Use Cases Collaborative Filtering Clustering Classification Takes users' behavior, and from that try to find items users might like Clustering Take things and put them into groups of related things Classification Learn from existing categories to determine what things in a category look like, and assign unlabeled things the (hopefully) correct category Frequent Itemset Mining Analyzes items in a groups and identifies which items frequently appear together

Clustering Dirichlet Processing Clustering K-Means Clustering Bayesian mixture modeling K-Means Clustering Partition n observations into k clusters Fuzzy K-Means Soft clusters where a point can be in more than one Hierarchical Clustering Hierarchy of clusters from bottom-up or top-down Canopy Clustering Preprocess data before K-Means or Hierarchical

More Clustering Latent Dirichlet Allocation Mean Shift Clustering Cluster words into topics and documents into mixtures of topics Mean Shift Clustering Finding modes or clusters in 2-dimensional space, where number of clusters is unknown Minhash Clustering Quickly estimate similarity between two data sets Spectral Clustering Cluster points using eigenvectors of matrices derived from data

Collaborative Filtering Distributed Item-based Collaborate Filtering Estimates a user’s preference for one item by looking at preference for similar items Collaborate Filtering using a Parallel Matrix Factorization Among a matrix of items that a user has not yet seen, predict which items the user might prefer

Classification Bayesian Random Forests Classify objects into binary categories Random Forests Method for classification and regression by constructing a multitude of decision trees Dog Cat

Frequent Itemset Mining Parallel FP Growth Algorithm Analyzes items in a group and then identifies which items appear together Frequent Pattern

Algorithm Examples K-Means Clustering Using Mahout Alternating Least Squares (Recommender) Using Spark Mllib Final question – what is the algorithm for k-means?

Apache Mahout ma·hout -\mə-ˈhau̇t\ - noun - A keeper and driver of an elephant

Overview Build a scalable machine learning library, in both data volume and processing Began in 2008 as a subproject of Apache Lucene, then became a top-level Apache project in 2010 No longer accepting Java MapReduce implementations in favor of Spark MLlib Address issues commonly found in ML libraries: Lack community, scalability, documentation/examples, Apache licensing Not well-tested Not research oriented Not built on existing production-quality projects Active Community An open-source cross platform Apache library for machine learning in Java Scalable, but not as simple as adding more nodes to cluster – depends on algorithm you are using, type of data, chosen feature vectors, and other things that can impact how well Mahout will scale Active community of smart people that like to spend their free time talking about machine learning use cases

Technical Requirements Linux Java 1.6 or greater Maven Hadoop Although, not all algorithms are implemented to work on Hadoop clusters

Building Mahout for Hadoop 2 Check out Mahout trunk with git git clone https://github.com/apache/mahout.git Build with Maven, giving it the proper Hadoop and HBase versions cd git mvn install -DskipTests \ -Dhadoop2 -Dhadoop2.version=2.6.0 \ -Dhbase.version=1.0.0 cd ../ mv mahout /usr/share/491s15 # Edit .bashrc/.bash_profile to add a $MAHOUT_HOME variable, # $MAHOUT_HOME/bin to the path, and # export HADOOP_CONF_DIR=/usr/share/491s15/hadoop/etc/hadoop

K-Means Clustering c1 c2 c3 Start with a bunch of points

K-Means Clustering Randomly generate centroids among the data Will often choose existing random points

K-Means Clustering c1 c2 c3 Using your distance measure, locate closest centroid for each data point

K-Means Clustering Find the new center, and move the centroids c2 c1

K-Means Clustering c1 c2 c3 Iterate through this process over and over again until convergence, i.e.

K-Means Clustering Example Let’s cluster the Reuter’s data set together A bunch (21,578 to be exact) of hand-classified news articles from the greatest year created, 1987 Steps! Generate Sequence Files from data Generate Vectors from Sequence Files Run k-means

K-Means Clustering Convert dataset into a Sequence File Download and extract the SGML files $ wget http://www.daviddlewis.com/resources/testcollections/ reuters21578/reuters21578.tar.gz $ mkdir reuters-sgm $ tar -xf reuters21578.tar.gz -C reuters-sgm/ Extract content from SGML to text file $ mahout org.apache.lucene.benchmark.utils.ExtractReuters \ reuters-sgm/ reuters-out/ $ hdfs dfs -put reuters-out . # Takes a while... Use seqdirectory tool to convert text file into a Hadoop Sequence File $ mahout seqdirectory -i reuters-out \ -o reuters-out-seqdir -c UTF-8 -chunk 5

Tangent: Writing to Sequence Files // Say you have some documents array Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; ++i) { writer.append(new Text(documents[i].getId()), new Text(documents[i].getContent())); } writer.close();

Original File $ cat reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991

Now, in Sequence File /reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991 Key Value* * Contains new line characters

K-Means Clustering Generate Vectors from Sequence Files Steps Compute Dictionary Assign integers for words Compute feature weights Create vector for each document using word-integer mapping and feature-weight Or simply run $ mahout seq2sparse $ mahout seq2sparse \ -i reuters-out-seqdir/ \ -o reuters-out-seqdir-sparse-kmeans

Document to Integers to Vector 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991 14.36 2737 15 2962 1991 3960 26 5405 43 8361 6.7 10882 billion 15528 curbs 19078 dlrs 20362 estate 21578 feb 22224 raising 33629 seek 35909 tax 38507 u.s 39687 writers 41511 { 3960:1.0, 21578:1.0, 33629:1.0, 41511:1.0, 8361:1.0, 10882:1.0, 5405:1.0, 22224:1.0, 15528:1.0, 38507:2.0, 39687:1.0, 2737:1.0, 35909:1.0, 2962:1.0, 19078:1.0, 20362:1.0 } One document of many!

After seq2sparse /reut2-000.sgm-30.txt {3960:1.0,21578:1.0, 33629:1.0,41511:1.0,8361:1.0,10882:1.0,5405:1.0,22224:1.0,15528:1.0,38507:2.0,39687:1.0,2737:1.0 ,35909:1.0,2962:1.0,19078:1.0,20362:1.0} Key Value “Feature Weights”

K-Means Clustering Run the kmeans program $ mahout kmeans \ -i reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c reuters-kmeans-clusters \ -o reuters-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -cd 0.1 -x 10 -k 20 Key Parameters dm: Distance measure cd: Convergence delta x: Number of iterations k: Creating assignments Input Vectors Input Clusters Output Working Directory Distance Measure Conversion Delta Maximum number of Iterations Number of initial clusters to sample from input vectors

Inspect clusters $ bin/mahout clusterdump \ -i reuters-kmeans/clusters-*-final \ -d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 10 :{"identifier":"VL-316","r":[{"00":0.497},{"00.14":0.408},{"00.18":0.408},{"00.56 Top Terms: president => 3.4944214993103375 chief => 3.3234287659025012 executive => 3.16472187060367 officer => 3.143776322498974 chairman => 2.5400053276308587 vice => 1.9913627557428164 named => 1.9851312619198411 said => 1.9030630459350324 company => 1.782354193948521 names => 1.4052995438811444

FAQs How to get rid of useless words? Increase minSupport and or decrease dfPercent Use StopwordsAnalyzer How to see documents to cluster assignments? Run clustering process at the end of centroid generation using –cl How to choose appropriate weighting? If its long text, go with tf-idf. Use normalization if documents different in length How to run this on a cluster? Set HADOOP_CONF directory to point to your hadoop cluster conf directory How to scale? Use small value of k to partially cluster data and then do full clustering on each cluster. term frequency–inverse document frequency

FAQs How to choose k? How to improve Similarity Measurement? Figure out based on the data you have. Trial and error Or use Canopy Clustering and distance threshold to figure it out Or use Spectral clustering How to improve Similarity Measurement? Not all features are equal Small weight difference for certain types creates a large semantic difference Use WeightedDistanceMeasure Or write a custom DistanceMeasure

Recommendations Help users find items they might like based on historical preferences Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”

Recommendations Alice 5 1 4 ? Bob 2 5 Peter 4 3 2

Recommendations Algorithm Neighborhood-based approach Works by finding similarly rated items in the user-item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient) Estimates a user's preference towards an item by looking at his/her preferences towards similar items

Recommendations Prediction: Estimate Bob's preference towards “The Matrix” Look at all items that a) are similar to “The Matrix“ b) have been rated by Bob => “Alien“, “Inception“ Estimate the unknown preference with a weighted sum

Recommendations MapReduce phase 1 Map – Make user the key (Alice, Matrix, 5) (Alice, Alien, 1) (Alice, Inception, 4) (Bob, Alien, 2) (Bob, Inception, 5) (Peter, Matrix, 4) (Peter, Alien, 3) (Peter, Inception, 2) Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Bob (Alien, 2) Bob (Inception, 5) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2)

Recommendations MapReduce phase 1 Reduce – Create inverted index Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Bob (Alien, 2) Bob (Inception, 5) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) (Inception, 5) Peter (Matrix, 4) (Alien, 3) (Inception, 2)

Recommendations MapReduce phase 2 Map – Isolate all co-occurred ratings (all cases where a user rated both items) Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Alien, Inception (2,5) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) (Inception, 5) Peter(Matrix, 4) (Alien, 3) (Inception, 2)

Recommendations MapReduce phase 2 Reduce – Compute similarities Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Alien, Inception (2,5) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4) Matrix, Alien (-0.47) Matrix, Inception (0.47) Alien, Inception(-0.63)

Recommendations Calculate Weighted sum (-.47*2 + .47*5) / (.47+.47) = 1.5

Recommendations Alice 5 1 4 Bob 1.5 2 5 Peter 4 3 2

Implementation in Spark Alternating Least Squares (ALS) Accepts a tuple of (user, product, rating) to train data Accepts a tuple of (user, product) to predict their rating Example: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

Implementations in Mahout ItemSimilarityJob Computes all item similarities Various configuration options: Similarity measure to use (cosine, Pearson-Correlation, etc.) Maximum number of similar items per item Maximum number of co-occurences to consider Input: CSV file (userId, itemID, value) Output: Pairs of itemIDs with associated similarity

Implementations in Mahout RecommenderJob Distributed Itembased Recommender Various configuration options: Similarity measure to use Number of recommendations per user Filter out some users or items Input: CSV file (userId, itemID, value) Output: UserIds with recommended itemIDs and their scores

References http://mahout.apache.org http://spark.apache.org http://isabel-drost.de/hadoop/slides/collabMahout.pdf http://www.slideshare.net/OReillyOSCON/hands-on-mahout# http://www.slideshare.net/urilavi/intro-to-mahout