Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

Slides:



Advertisements
Similar presentations
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Advertisements

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Systems Analysis I Data Flow Diagrams
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1.
Chapter 1 Introduction to Data Mining
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Algorithms Emerging IT Fall 2015 Team 1 Avenbaum, Hamilton, Mirilla, Pisano.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
A Decision Support Based on Data Mining in e-Banking Irina Ionita Liviu Ionita Department of Informatics University Petroleum-Gas of Ploiesti.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Big Data Processing of School Shooting Archives
Big Data & Test Automation
Classification CS5604 Information Storage and Retrieval - Fall 2016
Sentiment and Topic Analysis
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management Webpages
Presenters: Radha Krishnan, Sneha Mehta April 28, 2016
IDEALvr Team: Luciano Biondi, Omavi Walker, Dagmawi Yeshiwas
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Collection Management
Introduction to Data Science Lecture 7 Machine Learning Overview
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS 5604 Information Storage and Retrieval
כריית מידע -- מבוא ד"ר אבי רוזנפלד.
Collection Management Webpages Final Presentation
Presented by: Prof. Ali Jaoua
Tutorial for LightSIDE
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Michael Shuffett Virginia Tech Blacksburg, VA
Discriminative Frequent Pattern Analysis for Effective Classification
Katrina Database SearchKat
iSRD Spam Review Detection with Imbalanced Data Distributions
Adam Lech Joseph Pontani Matthew Bollinger
Classification and Prediction
Naïve Bayes Classifiers
Market Basket Analysis and Association Rules
Information Retrieval
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor: E. Fox Presenters: Hossameldin Shahin Matthew Bock Michael Cantrell May 3rd, 2016

Agenda Background Problem Statement Requirements and Design Implementation Details Evaluation Conclusion and future work

Background Classification is the process of determining which predefined class or set of classes a document can be sorted into. In our case: Classes based on IDEAL project collections Binary classification (relevant or nonrelevant to the collection)

Problem Statement For every tweet in a collection of tweets, determine the likelihood that tweet is actually relevant to the collection Twitter hashtags do a good job of naturally filtering tweets Still a significant amount of spam or otherwise irrelevant tweets Write this probability to a column in HBase for use by other teams in their systems

Problem Statement @JoelOsteen: There will be storms in life but they are only temporary. Praying for our friends in Houston and Texas. #HoustonFlood Relevance probability: ~0.92 #houstonflood #cambodia #GetStupid Watched a video *Car insurance quotes online* GREAT! Here: https://t.co/zChGDwQ8g9 http://t.co/nmq8Eiq8Np Relevance probability: ~0.08

Problem Statement

Requirements and Design Feed cleaned text from Collection Management team into FPM process Run tweet data through our process (described in more detail later) Write probabilities back to Classification:Relevance column HBase for other teams to use

Implementation Details Spark v1.6 IPython notebook Cluster vs. VM (Thanks to IT Security Lab at VT) Training data preparation End-To-End workflow Logistic regression classifier

Prepare Training data To train a binary classifier we need to prepare a training data Manual labeling: Requires a huge work Semi-Automated Query Solr for positive data set Use Frequent Pattern Mining to discover frequent itemsets Automated: Not feasible

Frequent Pattern Mining FPM is an algorithm for frequent itemset mining and association rule learning over transactional databases. FPM has two steps: Identify the frequent individual items Extend them to larger and larger item sets as long as those item sets appear sufficiently often in the database (minimum support)

Frequent Pattern Mining (Example) → → → Minimum Support = 3

Frequent Pattern Mining (#GermanWings)

Workflow

Which Classifier to use? How much training data do you have? None Manually written rules Amount of work required is huge Hand crafting rules produce high accuracy Very little High bias models (e.g. Naïve Bayes) Semi-supervised training methods (Boosting, EM) (Ch-16) Reasonable Best situation! Can use any model Huge The most accurate results Expensive and impractical ( SVMs train time or kNN test time) Naïve Bayes might become the best choice again!

Why Logistic Regression? Current Spark Architecture:

Evaluation Constructed 7 Surveys for manual labelling 675 tweets labelled from wdbj7shooting collection Ideally 3 responses per 100 tweets 16/21 responded to surveys All but one survey had multiple responses Some didn’t label all tweets on form

Survey Results Classifier marks more as Relevant than labelled results Very few marked as Non-Relevant

Conclusion and Future Work Evaluation shows that classifier accuracy could be improved, but trends similarly to manually labelled results More structured evaluation should be performed to validate results of initial evaluation Relevance thresholds may need to be determined for each collection Our design will make it easy to apply our system to other collections in the future First priority for future work is to put more time into web page processing Our system should cover web pages with minimal modifications, but we have not done much testing

Acknowledgments NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL). Dr. Fox GRAs Mohamed Magdy Farag Sunshin Lee Other teams

Thank you! Questions?