Download presentation
Presentation is loading. Please wait.
Published byMadison Dawson Modified over 11 years ago
1
Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland
2
About Us Ted Dunning: Chief Application Architect at MapR Committer and PMC Member at Apache Mahout Previously: MusicMatch (Yahoo! Music), Veoh recommendation, ID Analytics Robin Anil: Software Engineer at Google Committer and PMC Member at Apache Mahout Previously: Yahoo! (Display ads), Minekey recommendation
3
Agenda Intro to Mahout (5 mins) Overview of Algorithms in Mahout (10 mins) Hands on Mahout! -Clustering (30 mins) -Classification (30 mins) -Advanced topics with Q&A (15 mins)
4
Mission To build a scalable machine learning library
5
Scale! Scale to large datasets -Hadoop MapReduce implementations that scales linearly with data. -Fast sequential algorithms whose runtime doesnt depend on the size of the data -Goal: To be as fast as possible for any algorithm Scalable to support your business case -Apache Software License 2 Scalable community -Vibrant, responsive and diverse -Come to the mailing list and find out more
6
Current state of ML libraries Lack community Lack scalability Lack documentations and examples Lack Apache licensing Are not well tested Are Research oriented Not built over existing production quality libraries Lack Deployability
7
Algorithms and Applications
8
Clustering Call it fuzzy grouping based on a notion of similarity
9
Mahout Clustering Plenty of Algorithms: K-Means, Fuzzy K-Means, Mean Shift, Canopy, Dirichlet Group similar looking objects Notion of similarity: Distance measure: -Euclidean -Cosine -Tanimoto -Manhattan
10
Classification Predicting the type of a new object based on its features The types are predetermined Dog Cat
11
Mahout Classification Plenty of algorithms -Naïve Bayes -Complementary Naïve Bayes -Random Forests -Logistic Regression (SGD) -Support Vector Machines (patch ready) Learn a model from a manually classified data Predict the class of a new object based on its features and the learned model
12
Part 1 - Clustering
13
Understanding data - Vectors X = 5, Y = 3 (5, 3) The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3]) Y X
14
Representing Vectors – The basics Now think 3, 4, 5, ….. n-dimensional Think of a document as a bag of words. she sells sea shells on the sea shore Now map them to integers she => 0 sells => 1 sea => 2 and so on The resulting vector [1.0, 1.0, 2.0, … ]
15
Vectors Imagine one dimension for each word. Each dimension is also called a feature Two techniques -Dictionary Based -Randomizer Based
16
Clustering Reuters dataset
17
Step 1 – Convert dataset into a Hadoop Sequence File http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz Download (8.2 MB) and extract the SGML files. -$ mkdir -p mahout-work/reuters-sgm -$ cd mahout-work/reuters-sgm && tar xzf../reuters21578.tar.gz && cd.. && cd.. Extract content from SGML to text file -$ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters mahout-work/reuters-sgm mahout- work/reuters-out
18
Step 1 – Convert dataset into a Hadoop Sequence File Use seqdirectory tool to convert text file into a Hadoop Sequence File -$ bin/mahout seqdirectory \ -i mahout-work/reuters-out \ -o mahout-work/reuters-out-seqdir \ -c UTF-8 -chunk 5
19
Hadoop Sequence File Sequence of Records, where each record is a pair - -… - Key and Value needs to be of class org.apache.hadoop.io.Text -Key = Record name or File name or unique identifier -Value = Content as UTF-8 encoded string TIP: Dump data from your database directly into Hadoop Sequence Files (see next slide)
20
Writing to Sequence Files Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer( fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; i++) writer.append(new Text(documents(i).Id()), new Text(documents(i).Content())); } writer.close();
21
Generate Vectors from Sequence Files Steps 1.Compute Dictionary 2.Assign integers for words 3.Compute feature weights 4.Create vector for each document using word-integer mapping and feature-weight Or Simply run $ bin/mahout seq2sparse
22
Generate Vectors from Sequence Files $ bin/mahout seq2sparse \ -i mahout-work/reuters-out-seqdir/ \ -o mahout-work/reuters-out-seqdir-sparse-kmeans Important options -Ngrams -Lucene Analyzer for tokenizing -Feature Pruning -Min support -Max Document Frequency -Min LLR (for ngrams) -Weighting Method -TF v/s TFIDF -lp-Norm -Log normalize length
23
Start K-Means clustering $ bin/mahout kmeans \ -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c mahout-work/reuters-kmeans-clusters \ -o mahout-work/reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 \ -x 10 -k 20 –ow Things to watch out for -Number of iterations -Convergence delta -Distance Measure -Creating assignments
24
c1 c2 c3 K-Means clustering
25
c1 c2 c3 K-Means clustering
26
c1 c2 c3 c1 c2 c3 K-Means clustering
27
c1 c2 c3 K-Means clustering
28
Inspect clusters $ bin/mahout clusterdump \ -s mahout-work/reuters-kmeans/clusters-9 \ -d mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20 Typical output :VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, … Top Terms: iran => 3.1861672217321213 strike => 2.567886952727918 iranian => 2.133417966282966 union => 2.116033937940266 said => 2.101773806290277 workers => 2.066259451354332 gulf => 1.9501374918521601 had => 1.6077752463145605 he => 1.5355078004962228
29
FAQs How to get rid of useless words How to see documents to cluster assignments How to choose appropriate weighting How to run this on a cluster How to scale How to choose k How to improve similarity measurement
30
FAQs How to get rid of useless words -Increase minSupport and or decrease dfPercent -Use StopwordsAnalyzer How to see documents to cluster assignments -Run clustering process at the end of centroid generation using –cl How to choose appropriate weighting -If its long text, go with tfidf. Use normalization if documents different in length How to run this on a cluster -Set HADOOP_CONF directory to point to your hadoop cluster conf directory How to scale -Use small value of k to partially cluster data and then do full clustering on each cluster.
31
FAQs How to choose k -Figure out based on the data you have. Trial and error -Or use Canopy Clustering and distance threshold to figure it out -Or use Spectral clustering How to improve Similarity Measurement -Not all features are equal -Small weight difference for certain types creates a large semantic difference -Use WeightedDistanceMeasure -Or write a custom DistanceMeasure
32
Interesting problems Cluster users talking about OSCON11 and cluster them based on what they are tweeting -Can you suggest people to network with. Use user generate tags that people have given for musicians and cluster them -Use the cluster to pre-populate suggest-box to autocomplete tags when users type Cluster movies based on abstract and description and show related movies. -Note: How it can augment recommendations or collaborative filtering algorithms.
33
More clustering algorithms Canopy Fuzzy K-Means Mean Shift Dirichlet process clustering Spectral clustering.
34
Part 2 - Classification
35
Preliminaries Code is available from github: -git@github.com:tdunning/Chapter-16.gitgit@github.com:tdunning/Chapter-16.git EC2 instances available Thumb drives also available Email to ted.dunning@gmail.comted.dunning@gmail.com Twitter @ted_dunning
36
A Quick Review What is classification? -goes-ins: predictors -goes-outs: target variable What is classifiable data? -continuous, categorical, word-like, text-like -uniform schema How do we convert from classifiable data to feature vector?
37
Data Flow Not quite so simple
38
Classifiable Data Continuous -A number that represents a quantity, not an id -Blood pressure, stock price, latitude, mass Categorical -One of a known, small set (color, shape) Word-like -One of a possibly unknown, possibly large set Text-like -Many word-like things, usually unordered
39
But that isnt quite there Learning algorithms need feature vectors -Have to convert from data to vector Can assign one location per feature -or category -or word Can assign one or more locations with hashing -scary -but safe on average
40
Data Flow
42
The pipeline Classifiable Data Vectors
43
Instance and Target Variable
45
Hashed Encoding
46
What about collisions?
47
Lets write some code (cue relaxing background music)
48
Generating new features Sometimes the existing features are difficult to use Restating the geometry using new reference points may help Automatic reference points using k-means can be better than manual references
49
K-means using target
50
K-means features
51
More code! (cue relaxing background music)
52
Integration Issues Feature extraction is ideal for map-reduce -Side data adds some complexity Clustering works great with map-reduce -Cluster centroids to HDFS Model training works better sequentially -Need centroids in normal files Model deployment shouldnt depend on HDFS
53
Parallel Stochastic Gradient Descent Average models Train sub model Model InputInput
54
Variational Dirichlet Assignment Update model Gather sufficient statistics Model InputInput
55
Old tricks, new dogs Mapper -Assign point to cluster -Emit cluster id, (1, point) Combiner and reducer -Sum counts, weighted sum of points -Emit cluster id, (n, sum/n) Output to HDFS Read from HDFS to local disk by distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
56
Old tricks, new dogs Mapper -Assign point to cluster -Emit cluster id, 1, point Combiner and reducer -Sum counts, weighted sum of points -Emit cluster id, n, sum/n Output to HDFS MapR FS Read from NFS Read from NFS Written by map-reduce
57
Modeling architecture Feature extraction and down sampling InputInput Side-data Data join Sequential SGD Learning Map-reduce Now via NFS
58
More in Mahout
59
Topic modeling Grouping similar or co-occurring features into a topic -Topic Lol Cat: -Cat -Meow -Purr -Haz -Cheeseburger -Lol
60
Mahout Topic Modeling Algorithm: Latent Dirichlet Allocation -Input a set of documents -Output top K prominent topics and the features in each topic
61
Recommendations Predict what the user likes based on -His/Her historical behavior -Aggregate behavior of people similar to him
62
Mahout Recommenders Different types of recommenders -User based -Item based Full framework for storage, online online and offline computation of recommendations Like clustering, there is a notion of similarity in users or items -Cosine, Tanimoto, Pearson and LLR
63
Frequent Pattern Mining Find interesting groups of items based on how they co-occur in a dataset
64
Mahout Parallel FPGrowth Identify the most commonly occurring patterns from -Sales Transactions buy Milk, eggs and bread -Query Logs ipad -> apple, tablet, iphone -Spam Detection Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam
65
Get Started http://mahout.apache.org dev@mahout.apache.org - Developer mailing list dev@mahout.apache.org user@mahout.apache.org - User mailing list user@mahout.apache.org Check out the documentations and wiki for quickstart http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code http://svn.apache.org/repos/asf/mahout/trunk/ Send me email! -tdunning@maprtech.comtdunning@maprtech.com -tdunning@apache.orgtdunning@apache.org -robinanil@apache.orgrobinanil@apache.org Try out MapR! -www.mapr.com
66
Resources Mahout in Action Owen, Anil, Dunning, Friedman http://www.manning.com/owen http://www.manning.com/owen Taming Text Ingersoll, Morton, Farris http://www.manning.com/ingersoll http://www.manning.com/ingersoll Introducing Apache Mahout http://www.ibm.com/developerworks/java/library/j-mahout/ http://www.ibm.com/developerworks/java/library/j-mahout/
67
Thanks to Apache Foundation Mahout Committers Google Summer of Code Organizers And Students OSCON Open source!
68
References news.google.com Cat http://www.flickr.com/photos/gattou/3178745634/http://www.flickr.com/photos/gattou/3178745634/ Dog http://www.flickr.com/photos/30800139@N04/3879737638/http://www.flickr.com/photos/30800139@N04/3879737638/ Milk Eggs Bread http://www.flickr.com/photos/nauright/4792775946/http://www.flickr.com/photos/nauright/4792775946/ Amazon Recommendations twitter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.