Tutorial for LightSIDE June 4, 2018 Heejun Kim
The workflow for Text Mining Preparing data Extracting features Building model Predicting labels Error analysis Preparing Data: Movie Reviews, Positive vs. Negative Whether a figure is a triangle or not. In my case, I am interested in predicting whether information is credible or not. Bag of words representation LightSIDE can cover all except the first step
Installing LightSIDE You should have JRE (1.8 preferred) or JDK Download the zip file linked as “Program: LightSIDE” from the course website Unzip the file Mac: LightSide.app Windows: LightSide.bat Linux: run.sh JRE is an acronym for java running environment which is a basically virtual machine that help you to run Java-based program. Introduce where students can find the manual
Preparing Data (LightSIDE) CSV file (comma delimited text file) One column for text, another column for class Additional attributes (e.g., length) that are pre-processed can be included in additional columns (will be read by using column features extractor) Encoding: UTF-8 is recommended
Preparing Data (LightSIDE)
Open Files Open a file Select a file Check details For the simple work flow for the 2nd assignment, you will only need to have first, third and last tab. So let me go over the core process first and get some question and explore more functions later.
Extract Features Select extractor Configure detailed option Execute Check performance of features Set threshold For the second assignment, you are only going to use “Basic Features”. However, for the 3rd assignment, you may want to explore other feature extractors. Only Unigram and select the “Skip stopwords in N-Grams” option except #5 question LightSIDE allows you to set a threshold on the minimum number of training set instances that must contain a particular feature in order for that feature to make it into the feature representation. If we set t=2, then only terms that appear in at least 2 training set instances make into the feature representation.
An example of feature table
Building Model + Predicting Label Select a machine learning algorithm (e.g., NaiveBayes, Logistic Regression ) Evaluation method: independent training/test data or n-fold cross validation for your project Only Naïve Bayes algorithm
Building Model + Predicting Label Training data + independent test data Training data Test data Training set, validation set, testing set
Building Model + Predicting Label Cross validation (e.g., 5 fold) Test data Run Accuracy 1st 0.78 2nd 0.76 3rd 0.77 4th 0.73 Over-fitting 5th 0.79 avg 0.766 Training data
Build Models Configure detailed option Select algorithm Select a feature table Select evaluation option Execute Only Naïve Bayes algorithm. Options may be appropriate for working with numeric feature values, but are generally unimportant. In some cases, these configuration are important. Sometimes using train.csv for training and testing, other times, using trains.csv for training and using test.csv for testing Showing table in the assignment and explain what training set accuracy and testing set accuracy Check performance of prediction
Select exploration method Explore Results Select case Select a model Select a feature Select matrix Select exploration method Check the case Click a case
Choose a file you want to make prediction for Predict Labels Export results Select a model Check results Choose a file you want to make prediction for
Any questions?