Text Classification Seminar Social Media Mining University UC3M

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Indian Statistical Institute Kolkata
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Disasters and Human Factors Literature Nestor L Osorio Northern Illinois University.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
How to find INFORMATION for your assignments. Information Age = Information Overload? “ How do I sort through it all to find what I need?! ”
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Information Literacy. Information Literacy includes: The ability of a student to: 1.Identify the need for information Select a topic 2.Access information.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Spam Detection Ethan Grefe December 13, 2013.
Business and Management Research
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Classification using Co-Training
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Distinguishing authorship
Information Retrieval in Practice
Brief Intro to Machine Learning CS539
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Library Research Workshop
Document Development Cycle
Text Indexing and Search
Annotated Bibliography
Sentiment Analysis Seminar Social Media Mining University UC3M
50 Minutes Session 23 Curriculum Vitae Preparation and Maintenance.
Eco 6380 Predictive Analytics For Economists Spring 2016
Second Language Acquisition
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Introduction to Data Science Lecture 7 Machine Learning Overview
Test Review Be prepared to provide an answer.
Writing a Research Abstract
Supervised Learning Seminar Social Media Mining University UC3M
MID-SEM REVIEW.
Dipartimento di Ingegneria «Enzo Ferrari»,
Lecture 15: Text Classification & Naive Bayes
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
How do I know if articles are scholarly or peer reviewed?
Mining and Analyzing Data from Open Source Software Repository
Hui Ping, Chuan Yin, Xuan Qi Group 5
Oregon Department of Education
HCC class lecture 13 comments
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Discriminative Frequent Pattern Analysis for Effective Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
Text Analytics and Machine Learning Workshop Machine Learning Session
Spreadsheets, Modelling & Databases
Chapter 7: Transformations
Information Retrieval
Evaluating Classifiers
NLP.
Stance Classification of Ideological Debates
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

Text Classification Seminar Social Media Mining University UC3M Date May 2017 Lecturer Carlos Castillo http://chato.cl/ Sources: CS124 slides by Dan Jurafsky Slides by Muhammad Atif Qureshi & Arjuman Younus – 2017

Facebook study (comments and timeline posts) Burke, Moira, Lada A. Adamic, and Karyn Marciniak. "Families on Facebook." In ICWSM. 2013. Featured in a blogpos by M. Burke.

Example applications “Federalist Papers” in USA Gmail smart folders Mosteller, Frederick, and David L. Wallace. "Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers." Journal of the American Statistical Association 58, no. 302 (1963): 275-309. Gmail smart folders

Per-document frequency of use of the word “you” in fiction documents Male author Female author “even in formal writing, female writing exhibits greater usage of features identified by previous researchers as ‘involved’ while male writing exhibits greater usage of features which have been identified as ‘informational’.” Argamon, S., Koppel, M., Fine, J. and Shimoni, A.R., 2003. Gender, genre, and writing style in formal written texts. TEXT 23(3), pp.321-346.

Positive or negative review? Given a text, determine if the author is praising or complaining about a monument / landmark http://mashable.com/2015/01/09/one-star-yelp-historical-landmarks/

Academic articles Antagonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology …

Text classification problems Generic documents → Topics, Keywords, … → Author age, Author gender, … → Language Messages → Folder(s), Priority, Spam?, … Usual approach: supervised learning methods

Learning on text The most obvious mapping is: Each document is an input element Each word is a possible feature Huge dimensionality (order of hundred of thousands words) need sparse representations

Determining features Apply pre-processing pipeline of search Join tokens when needed (e.g., “AK-48”, part numbers, chemical formulas, etc.) May need to emphasize words in title, abstracts, or section headings One option: multiply input dimensionality by number of existing blocks (“embryo” in title is completely unrelated to “embryo” in body) Another option: increase weight of title words, section headers, heuristically Term frequency not relevant for short messages

Training data is essential SVMs and Random forests are popular choices Very little training data: Naïve Bayes (However, I would say just get more training data) The amount of training data will vary during the learning cycle In practice: With a few hundred examples per class you already see that obvious examples are classified correctly With a few thousand examples per class less common cases start to be classified correctly

The devil is in the details Real systems may use automatic classification and a few carefully hand-crafted rules Real systems incorporate continuously new examples to maintain and improve performance Commonly you have unbalanced classes Need to get many examples of the minority class, can obtain them by keyword filtering, but that biases the training data (harms generative models)

Evaluating Evaluation can be done on a hold-out set If more data is becoming available … how do we know our classifier is performing better? Cross-validation Fixed assignment to test or hold-out (validation)

Cross-validation Divide sample into n “folds” 5 in this example For k = 1 … n Train on all folds except fold k Test on fold k Average n runs → result

With unbalanced classes, accuracy becomes meaningless Need to analyze confusion matrix Example: classes are { uk, poultry, …, trade }

Micro- and Macro-Average Micro-Average Evaluate every item separately Macro-Average Evaluate each class separately, then average

Micro- and Macro-Average (cont.)