Industrial Strength Machine Learning Jeff Eastman

Slides:



Advertisements
Similar presentations
Machine Learning on Spark
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Machine Learning with MapReduce. K-Means Clustering 3.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Recommender System with Hadoop and Spark
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Machine Learning with EM 闫宏飞 北京大学信息科学技术学院 7/24/2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
MapReduce for Machine Learning on Multicore
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
P ARALLEL A NALYSIS OF E GG D ATA WITH HADOOP ON FUTUREGRID Project Member: Rewati Ovalekar Project Guide : Gregor von Laszweski, Lizhe Wang.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Optimization Indiana University July Geoffrey Fox
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Next Generation of Apache Hadoop MapReduce Owen
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
A Simple Approach for Author Profiling in MapReduce
Image taken from: slideshare
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Big Data Infrastructure
Big Data is a Big Deal!.
Presented by: Javier Pastorino Fall 2016
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Semi-Supervised Clustering
Scalable Machine Learning
A Straightforward Author Profiling Approach in MapReduce
Tutorial: Big Data Algorithms and Applications Under Hadoop
Introducing Apache Mahout
Constrained Clustering -Semi Supervised Clustering-
Spark Presentation.
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Cloud Distributed Computing Environment Hadoop
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
HPML Conference, Lyon, Sept 2018
Charles Tappert Seidenberg School of CSIS, Pace University
Introducing Apache Mahout
Presentation transcript:

Industrial Strength Machine Learning Jeff Eastman Apache Mahout Industrial Strength Machine Learning Jeff Eastman

Current Situation Large volumes of data are now available Platforms now exist to run computations over large datasets (Hadoop, HBase) Sophisticated analytics are needed to turn data into information people can use Active research community and proprietary implementations of “machine learning” algorithms The world needs scalable implementations of ML under open license - ASF

Where is ML Used Today Internet search clustering Knowledge management systems Social network mapping Taxonomy transformations Marketing analytics Recommendation systems Log analysis & event filtering SPAM filtering, fraud detection

History of Mahout Summer 2007 Community formed Developers needed scalable ML Mailing list formed Community formed Apache contributors Academia & industry Lots of initial interest Project formed under Apache Lucene January 25, 2008

Who We Are (so far) Grant Ingersoll Dawid Weiss Ozgur Yilmazel Erik Hatcher Karl Wettin Jeff Eastman Ted Dunning Sean Owen Otis Gospodnetic Isabel Drost

Current Code Base Matrix & Vector library Clustering Utilities Hama collaboration for very large arrays Clustering Canopy K-Means Mean Shift Utilities Distance Measures Parameters

Example: K-Means Given K, assign the first K random points to be the initial cluster centers Assign subsequent points to the closest cluster using the supplied distance measure Compute the centroid of each cluster and iterate the previous step until the cluster centers converge within delta Run a final pass over the points to cluster them for output

K-Means Map/Reduce Design Driver Runs multiple iteration jobs using mapper+combiner+reducer Runs final clustering job using only mapper Mapper Configure: Single file containing encoded Clusters Input: File split containing encoded Vectors Output: Vectors keyed by nearest cluster Combiner Input: Vectors keyed by nearest cluster Output: Cluster centroid vectors keyed by “cluster” Reducer (singleton) Input: Cluster centroid vectors Output: Single file containing Vectors keyed by cluster

K-Means Hadoop Implementation KMeansDriver runJob() runIteration() isConverged() runCluster() KMeansMapper configure() map() KMeansCombiner reduce() KMeansReducer Cluster configure() formatCluster() decodeCluster() addPoint() computeCentroid() accessors

Algorithms Under Development Naïve Bayes Perceptron PLSI/EM Taste Collaborative Filtering Integration Genetic Programming Dirichlet Process Clustering

GSoC @ Mahout Many interesting submissions 4 projects approved for Mahout (http://code.google.com/soc/2008/asf/about.html) “Mahout: Parallel implementation of [NB/SOM/RF] machine learning algorithms”, Farid Bourennani “Implementing Logistic Regression in Mahout”, Yun Jiang “Codename Mahout.GA for mahout-machine-learning”, Abdel Hakim Deneche “To implement Complementary Naïve Bayes and Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil

Conclusion This is just the beginning High demand for scalable machine learning Contributors needed who have Interest, enthusiasm & programming ability Test driven development readiness Comfort with the scary math (or bravery) Interest and/or proficiency with Hadoop Some large data sets you want to analyze Access to clusters that we could use for testing