Apache Mahout Industrial Strength Machine Learning Jeff Eastman.

Slides:



Advertisements
Similar presentations
Machine Learning on Spark
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management
Quantitative Research and Analytics, Proprietary and Confidential1 Ryan Michaluk
An Overview of Machine Learning
Machine Learning with MapReduce. K-Means Clustering 3.
Recommender System with Hadoop and Spark
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Machine Learning with EM 闫宏飞 北京大学信息科学技术学院 7/24/2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
MapReduce for Machine Learning on Multicore
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
Tyson Condie.
Introduction to Machine Learning MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way Based in part on notes from Gavin Brown, University of Manchester.
Chapter 13 Genetic Algorithms. 2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks Chapter.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Machine Learning as a Service
Apache Mahout Qiaodi Zhuang Xijing Zhang.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Breaking points of traditional approach What if you could handle big data?
Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Optimization Indiana University July Geoffrey Fox
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Mining of Massive Datasets Edited based on Leskovec’s from
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Big Data is a Big Deal!.
Presented by: Javier Pastorino Fall 2016
A Straightforward Author Profiling Approach in MapReduce
Industrial Strength Machine Learning Jeff Eastman
Tutorial: Big Data Algorithms and Applications Under Hadoop
Introducing Apache Mahout
Status and Challenges: January 2017
Data Analytics for ICT.
Hadoop Clusters Tess Fulkerson.
DATA ANALYTICS AND TEXT MINING
Mining and Analyzing Data from Open Source Software Repository
Machine Learning with Weka
Introduction to Apache
HPML Conference, Lyon, Sept 2018
Charles Tappert Seidenberg School of CSIS, Pace University
Overview of deep learning
Developing Vehicular Data Cloud Services in the IoT Environment
Introducing Apache Mahout
Presentation transcript:

Apache Mahout Industrial Strength Machine Learning Jeff Eastman

Current Situation Large volumes of data are now available Platforms now exist to run computations over large datasets (Hadoop, HBase) Sophisticated analytics are needed to turn data into information people can use Active research community and proprietary implementations of “machine learning” algorithms The world needs scalable implementations of ML under open license - ASF

Where is ML Used Today Internet search clustering Knowledge management systems Social network mapping Taxonomy transformations Marketing analytics Recommendation systems Log analysis & event filtering Fraud detection

History of Mahout Summer 2007 – Developers needed scalable ML – Mailing list formed Community formed – Apache contributors – Academia & industry – Lots of initial interest Project formed under Apache Lucene – January 25, 2008

Who We Are (so far) Grant Ingersoll Karl Wettin Isabel DrostTed DunningJeff Eastman Dawid WeissOtis Gospodetnic Erik Hatcher

Current Code Base Matrix & Vector library – Hama collaboration for very large arrays Clustering – Canopy – K-Means – Mean Shift Utilities – Distance Measures – Parameters

Algorithms Under Development Naïve Bayes Perceptron PLSI/EM Taste Collaborative Filtering Integration Genetic Programming Dirichlet Process Clustering

Mahout Many interesting submissions 4 projects approved for Mahout ( – “Mahout: Parallel implementation of machine learning algorithms”, Farid Bourennani – “Implementing Logistic Regression in Mahout”, Yun Jiang – “Codename Mahout.GA for mahout-machine- learning”, Abdel Hakim Deneche – “To implement Complementary Naïve Bayes and Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil

Conclusion This is just the beginning High demand for scalable machine learning Contributors needed who have – Interest, enthusiasm & programming ability – Test driven development readiness – Comfort with the scary math (or bravery) – Interest and/or proficiency with Hadoop – Some large data sets you want to analyze