USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Florida International University COP 4770 Introduction of Weka.
Nonparametric Methods: Nearest Neighbors
ECG Signal processing (2)
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Evaluation of Decision Forests on Text Categorization
Indian Statistical Institute Kolkata
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lecture 3 Nonparametric density estimation and classification
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Pattern Recognition and Machine Learning
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Instance Based Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Lecture 5 (Classification with Decision Trees)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Scalable Text Mining with Sparse Generative Models
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Final review LING572 Fei Xia Week 10: 03/11/
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Bayesian Networks. Male brain wiring Female brain wiring.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
K Nearest Neighborhood (KNNs)
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Ensemble Methods in Machine Learning
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Validation methods.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
KNN & Naïve Bayes Hongning Wang
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
COMP61011 : Machine Learning Ensemble Models
Basic machine learning background with Python scikit-learn
Overview of Supervised Learning
K Nearest Neighbor Classification
Instance Based Learning
Nonparametric density estimation and classification
Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev
CAMCOS Report Day December 9th, 2015 San Jose State University
Presentation transcript:

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu

Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

Background A problem from Kaggle Predict the category of cuisine from the recipe ingredients pasta -> Italian, kimchi -> Korean, curry -> Indian

Challenge Multi-label classification 6715 features If we use binary label for every ingredient in each recipe, the train data will be too large. Huge number of labels to train Quite different from ‘Yes’ or ‘No’ label. class-imbalanced the Italian and Indian food dominate the whole recipe while we could seldom see one or two cuisine called “cajun_creole”

Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

Lemmatization Characters hard to handle in the data set ™ and ® - delete it, do not influence the result. French character(é, ù) – replace it by a similar English character, guarantee the word is unique in the features after replacing. Plural form eggs and egg. NLTK(Natural Language Toolkit) – lemmatize the word according to the dictionary in toolkit.

TF-IDF The problem is similar to label document according to the content in the document. Term Frequency–Inverse Document Frequency(TF-IDF), a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Using the raw frequency of a term TF(t) means the number of times that term t occurs in content. After lemmatization and TF-IDF, we reduce feature from 6715 to 2774

Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

k-NN scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier: implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. We choose the 1 st classifier, set k = 1. The result should be taken as basic standard of all classifers’ performance

Naive Bayes The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). In practice, fractional counts such as tf-idf may also work. Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality. Performed better than expected: Attributes are relatively independent compared with word vectors in text.

Parameters Default alpha = 1 We set alpha = 0.01 for N(N<1) is much smaller than n

Linear Support Vector Classification The advantages of support vector machines are: Effective in high dimensional spaces. Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. The disadvantages of support vector machines include: If the number of features is much greater than the number of samples, the method is likely to give poor performances. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

Linear Support Vector Classification Multiclass support is handled according to One-Vs-All scheme Radial Basis Function kernel

Parameters Default parameters Penalty parameter C of the error term is 1.0 Dual = true We set C = 0.8 Dual = false

Logistic Regression Classification Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Parameters We use GridSearchCV to find the best parameters.

Random Forest A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

Parameters By default, the number of trees in the forest is 10 We set the number of trees in the forest to 100 More trees will cover more features. The larger the better, but also the longer it will take to compute.

Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

Evaluation Setup Python 3.3 for windows Two library NLTK(Natural Language Toolkit) Scikit-learn Evaluation metric Accuracy Time

Accuracy

Time The time for 1NN is longer than 5 hours

Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

The preprocessing step dramatically save the execution time. Different parameter will significantly affect the result Considering both accuracy and time, Linear SVC is the best choice.