Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management
Advertisements

Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Empowering visual categorization with the GPU Present by 陳群元 我是強壯 !
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
CES 514 – Data Mining Lecture 8 classification (contd…)
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
MapReduce for Machine Learning on Multicore
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
Presented By Wanchen Lu 2/25/2013
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Data mining and machine learning A brief introduction.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Machine Learning in CSC 196K
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Usman Roshan Dept. of Computer Science NJIT
Image taken from: slideshare
Dimensionality Reduction and Principle Components Analysis
A Peta-Scale Graph Mining System
Machine Learning Models
Industrial Strength Machine Learning Jeff Eastman
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Interactive Website (
DATA ANALYTICS AND TEXT MINING
Machine Learning Week 1.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Parallel Analytic Systems
Objectives Data Mining Course
Artificial Intelligence Lecture No. 28
Asymmetric Transitivity Preserving Graph Embedding
Multivariate Methods Berlin Chen
Machine Learning – a Probabilistic Perspective
What is Artificial Intelligence?
Presentation transcript:

Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou

Introduction What is redpoll? Who will use redpoll? Motivation Challenge from large-scale datasets More pratical when mining textual corpus Close to we chinese people Apache licensed

Basic Principles... Decomposition Mappers Reducer Assume that we have a set of m data points each of length n

Performance Bottlenecks Network bandwidth I/O speed Algorithm implementations Hadoop

Current Works Vector Writable utils Distance Measure utils Naive Bayes Canopy K-means An Infrastructure for textual DM An example for mining Sogou news

An example: Canopy Large, high dimensional Large, high dimensional datasets clustering Two different distance Two different distance Two stages Two stages Computation saving Applying many domains Applying many domains EM, GAC, K-means EM, GAC, K-means

An example: Canopy cont'd CanopyDriver CanopyMapper Input output CanopyReducer output ClusterDriver & ClusterMapper assign each point to canopies

What's the Next? SVM(Support Vector Machine) Fast in training and prediction Optimal hyperplane Kernels Duality Decomposition Parallelize approach

Algorithms under plan EM(Expectation Maximization) LSI(Latant Semantic Indexing) SVD (Singular Values Decomposition) PCA(Principal Components Analysis) PageRank KNN(k Nearest Neighbors) Linear Regression and so on...

Welcome to join us! Development Documentation Source code management Suggestion Any other things can help us

Check it out!