Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou
Introduction What is redpoll? Who will use redpoll? Motivation Challenge from large-scale datasets More pratical when mining textual corpus Close to we chinese people Apache licensed
Basic Principles... Decomposition Mappers Reducer Assume that we have a set of m data points each of length n
Performance Bottlenecks Network bandwidth I/O speed Algorithm implementations Hadoop
Current Works Vector Writable utils Distance Measure utils Naive Bayes Canopy K-means An Infrastructure for textual DM An example for mining Sogou news
An example: Canopy Large, high dimensional Large, high dimensional datasets clustering Two different distance Two different distance Two stages Two stages Computation saving Applying many domains Applying many domains EM, GAC, K-means EM, GAC, K-means
An example: Canopy cont'd CanopyDriver CanopyMapper Input output CanopyReducer output ClusterDriver & ClusterMapper assign each point to canopies
What's the Next? SVM(Support Vector Machine) Fast in training and prediction Optimal hyperplane Kernels Duality Decomposition Parallelize approach
Algorithms under plan EM(Expectation Maximization) LSI(Latant Semantic Indexing) SVD (Singular Values Decomposition) PCA(Principal Components Analysis) PageRank KNN(k Nearest Neighbors) Linear Regression and so on...
Welcome to join us! Development Documentation Source code management Suggestion Any other things can help us
Check it out!