Classification CS5604 Information Storage and Retrieval - Fall 2016

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Random Forest Predrag Radenković 3237/10
Imbalanced data David Kauchak CS 451 – Fall 2013.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Decision Tree Rong Jin. Determine Milage Per Gallon.
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Big Data Processing of School Shooting Archives
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
A Simple Approach for Author Profiling in MapReduce
Topic Modeling for Short Texts with Auxiliary Word Embeddings
A Smart Tool to Predict Salary Trends of H1-B Holders
Sentiment Analysis of Twitter Messages Using Word2Vec
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Deep learning David Kauchak CS158 – Fall 2016.
Collection Management Webpages
Instance Based Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Basic machine learning background with Python scikit-learn
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Policy Compression for MDPs
Clustering and Topic Analysis
NBA Draft Prediction BIT 5534 May 2nd 2018
Clustering tweets and webpages
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
CS 5604 Information Storage and Retrieval
On Spatial Joins in MapReduce
CMPT 733, SPRING 2016 Jiannan Wang
Q4 : How does Netflix recommend movies?
Collaborative Filtering Matrix Factorization Approach
Collection Management Webpages Final Presentation
Information Storage and Retrieval
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Word Embedding Word2Vec.
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Neural Networks Geoff Hulten.
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Text Mining CSC 576: Data Mining.
Word2Vec.
Junheng, Shengming, Yunsheng 11/09/2018
Word embeddings (continued)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
CMPT 733, SPRING 2017 Jiannan Wang
Introduction to Sentiment Analysis
Automatic Handwriting Generation
Word representations David Kauchak CS158 – Fall 2016.
Lecture 16. Classification (II): Practical Considerations
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Classification CS5604 Information Storage and Retrieval - Fall 2016 Virginia Polytechnic Institute and State University Blacksburg, Virginia 24061 Professor: E. Fox Presenters: Saurabh Chakravarty, Eric Williamson December 1, 2016

Table of contents Problem Definition High-Level Architecture Data-Retrieval and Processing Classification Experimental Results Conclusion and Future work

Problem Statement Given the tweets in the GETAR and IDEAL collections and a set of real world events, determine which tweets belong to each real world event.

Problem Statement Tweet Collection Real world event Hurricane Sandy Hurricane Arthur 188: #Arthur Fairdale Tornado 182: #tornado 632: fairdale Manhattan Explosion 174: #Manhattan

High level architecture Classification pipeline

TRAINING PIPELINE Training data

(clean/lemmatize/remove stop words) TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Duplicate remove stop word

(clean/lemmatize/remove stop words) TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors

(clean/lemmatize/remove stop words) TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors Word2Vec Model

(clean/lemmatize/remove stop words) TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors Word2Vec Model Train Classifier

(clean/lemmatize/remove stop words) TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors Word2Vec Model Classifier Model Train Classifier

Prediction PIPELINE Pre-processing ideal-cs5604f16 Word2Vec Model (clean/lemmatize/remove stop words) <<table>> ideal-cs5604f16 Word2Vec Model Classifier Model

Prediction PIPELINE Pre-processing Load Word2Vec & Classifier model (clean/lemmatize/remove stop words) <<table>> ideal-cs5604f16 Load Word2Vec & Classifier model Word2Vec Model Classifier Model

Prediction PIPELINE Pre-processing Load Word2Vec & Classifier model (clean/lemmatize/remove stop words) <<table>> ideal-cs5604f16 Load Word2Vec & Classifier model Word2Vec Model Classifier Model Predict Tweet Event

Prediction PIPELINE Pre-processing Load Word2Vec & Classifier model (clean/lemmatize/remove stop words) <<table>> ideal-cs5604f16 Load Word2Vec & Classifier model Word2Vec Model Classifier Model Predict Tweet Event

Data Retrieval Challenges Large amounts of data (1.5 billion tweets) Have to avoid serial execution Cannot fit into memory Prevent reprocessing data unnecessarily

Data Retrieval from HBase Retrieval Method Description Smaller collection performance Larger collection performance Spark HadoopRDD Load data into driver and parallelize across the cluster. Seamless reading from HBase. Hangs and does not complete reading on collections > one million. Batch Processing Load one batch at a time onto the drive and parallelize across the cluster. Slower reading due to batch overhead. Allows classification or arbitrarily large collections.

Challenges With tweets Abbreviations and Slang (u,omg) Non English URLs and characters Misspellings RT: @AssociationsNow A Year After Texas Explosion Federal Repourt Outlines Progress on Fertilize... http://t.co/8fDbMu9asU #meetingprofs Put stuff as examples

Text-Preprocessing Remove URLs Remove the # characters Lemmatization using StanfordNLP Stopword removal

Cleaning Example Raw: Clean: RT: @AssociationsNow A Year After Texas Explosion Federal Report Outlines Progress on Fertilize... http://t.co/8fDbMu9asU #meetingprofs Clean: year texas explosion federal report outline progress fertilize meetingprof

Training Data Generation Random samples of collections corresponding to real world events. Tweet assigned a number with the class it belongs to most. Hand Labeled 1000 Tweets

Class distribution for training and test data May want to put a pie chart instead?

Text Classification Feature selection Feature representation Choice of classifier

A comparison of Feature selection techniques Advantages Disadvantages Tf-idf Superior for small feature. High term removal capability. Accuracy suffers for large datasets. Mutual information Simple to implement. Inferior accuracy performance. Association rules Fast execution. Very good accuracy for multi-class scenarios. Easy to interpret the rules Prone to discovering too many rules or poorly understandable rules that hurt performance and interpretation. Chi-square statistic Robust accuracy and performance with large sample sets with fewer classes. Difficulty in interpretation of when there are a large number of classes. Within class popularity Identifies words that are most discriminative. Ignores the sequence of words. Word2Vec Captures relationships of a word with neighbors. High computational complexity. Long training time for large sample size. For more details, please refer the appendix section.

Common Feature representation techniques One-hot encoding Bag of words Challenges Large number of dimensions Word relationships with neighbors are not captured

Word2Vec A feature selection technique. Captures the semantic context of a word’s relation with neighbors. Why? Because For more details, refer the appendix.

Word2vec Similar words are grouped together and closer to one another. Slide courtesy - http://files.meetup.com/12426342/5_An_overview_of_word2vec.pdf

Word2vec Word displacements are relationships between the words. Word relationships are displacements

Word2vec We don’t care about the neural network and the output. We just need the hidden layer representation of the word vectors. Slide courtesy - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Word2vec We don’t care about the neural network and the output. We just need the hidden layer representation of the word vectors. Slide courtesy - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Classifier – Implementation details A word feature is an average of the word vectors generated by the Word2Vec model. We used a vector representation with a default of 100 values. We chose the multi-class logistic regression in the Spark framework to perform classification. The classifier labels the predicted class along with the normalized probabilities of the other classes.

Experiments Effect of preprocessing Accuracy performance Runtime performance Probability distribution Class assignment

Cleaning Experiment Determine how cleaning the data influences accuracy Cleaning: Lemmatization Stopword removal Hashtag removal Experimental setup: Split hand-labeled data: 70% train 30% test

Cleaning improves accuracy! Word2Vec with logistic regression: 29% fewer misclassifications Association Rules: 51% fewer misclassifications

Accuracy Experiment Determine which classifier gives better results on labeled data Experimental setup: Generate 10 different breakups of the labeled data Calculate metrics for each classifier on same breakup 9 different classes

Accuracy Experimental Setup Labeled Data Divided into 10 sets Training Test 7 sets 3 sets Rotate sets Generate 10 different training and test sets

Word2vec outperforms association rules Only f1 reduced errors (Lr-ar)/(1-ar) Word2Vec with logistic regression had a 6.7% increase in F1 score over association rules.

Classifier Runtime Performance Need to be able to handle large collections efficiently to classify all of the tweets Classify at a rate faster than the tweets coming in Allow reruns as more classes are added to the training set

Classification Runtime Performance

Logistic Regression Model Optimization HBase Read Partition and distribute Broadcast models to each partition Partition 1 Partition 2 Partition 3 Word2Vec Model Clean Clean Clean Logistic Regression Model Classify Classify Classify Write Write Write

Processing across partitions increases runtime performance! 57% faster than original Word2Vec 14% faster than Association Rules

Probability Distribution for Test Data

MULTI-Class Assignment Distribution for Test Data

Conclusion Reading data in blocks from HBase and then partitioning it into parallel tasks results in huge run time performance efficiency and predictability. Cleaning text based on the English usage nuances in the Twitter universe results in better accuracy. Feature selection methods like Word2Vec that capture richer word semantics and context result in better accuracy than traditional ones for text classification. It is natural for a Tweet to be classified in multiple classes and the tradeoff between precision and recall is dependent on the user/product requirements.

Future work The system can be retrained using a bigger corpus to generate a newer set of word vectors. Training on a text corpus like Google News can help generate word vectors that have richer word relationships encoded within. These can help improve the classification accuracy. The Logistic Regression classifier can be retrained on new classes. The system will be configured to run via a cron job periodically. In addition to classifying a tweet, the system also emits probabilities of all the classes that could be saved in HBase and can be used by SOLR or the front-end team to use as a criterion for customizing the indexing or user experience. Comparisons can be performed with the results of the developed classifier with the AR classifier or a few more classifiers and a inter-classifier agreement analysis can throw further light on the efficacy of the developed classifier. Bigger Corpus(Google News). Gives -> word vectors with richer relationships. Better word vectors -> Better classification accuracy. The cron job would ensure that the new tweets are classified as they come in, of course, in batches.

Acknowledgements We would like to acknowledge and thank the following for assisting and supporting us throughout this project. Dr. Edward Fox, Dr. Denilson Alves Pereira NSF grant IIS - 1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR) NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL) Digital Library Research Laboratory Graduate Research Assistant – Sunshin Lee Other teams in CS 5604

Appendix

Feature selection techniques Advantages Disadvantages Tf-idf Superior for small feature sets that have a large scatter of features among the classes. High term removal capability. Accuracy suffers for large data sets where a term distribution alone does not suffice in class discrimination. Mutual information Simple to implement. Inferior performance in estimation of probabilities because of bias. Association rules Fast execution. Very good accuracy for multi-class scenarios. Rule based classifier helps understand the classification decision easily. Prone to discovering too many rules or poorly understandable rules that hurt performance and interpretation. Chi-square statistic Robust accuracy and performance with large sample sets with fewer classes. Difficulty in interpretation of when there are a large number of classes. Within class popularity Identifies words that are most discriminative. Ignores the sequence of words. Word2Vec Captures relationships of a word with neighbors. High computation complexity. Long training time for large sample size.

Word2Vec Slide courtesy - http://files.meetup.com/12426342/5_An_overview_of_word2vec.pdf

Word2vec Can learn the word vectors via two forms. CBOW Predict the word, given the context. CBOW – Given a context, predict the word. Skip-gram – Given a word, predict the neighbors Slide courtesy - http://files.meetup.com/12426342/5_An_overview_of_word2vec.pdf

Word2vec Skip-gram – Inverse objective of CBOW. Predict the context, given a word. CBOW – Given a context, predict the word. Skip-gram – Given a word, predict the neighbors Slide courtesy - http://files.meetup.com/12426342/5_An_overview_of_word2vec.pdf

Real World Events Used for Experiments The real world event along with the amount of tweets labeled as that class for our experimental sets. Hurricane Sandy (108 tweets) Hurricane Isaac (83 tweets) New York Firefighter Shooting (58 tweets) Kentucky Accidental Child Shooting (16 tweets) Newtown School Shooting (157 tweets) Manhattan Building Explosion (189 tweets) China Factory Explosion (178 tweets) Texas Fertilizer Explosion (120 tweets) Hurricane Arthur (169 tweets)

Real World Events Classified The real world events classified along with some collections tweets of that event are found. Real World Event Collections Hurricane Sandy 23,27,375 Hurricane Isaac 27,28,375 New York Firefighter Shooting 43,46 Kentucky Accidental Child Shooting 45,46 Newtown School Shooting 41,42,46 Manhattan Building Explosion 173,174,399,400 China Factory Explosion 231,232 Texas Fertilizer Explosion 77,381 Hurricane Arthur 27,187,188,375 Quebec Train Derailment 96,98,381 Fairdale Tornado 406,632 Oklahoma Tornado 406,84 Mississippi Tornado 406,528 Alabama Tornado 406,407