Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages and Identifying Authors. Paul Kantor Rutgers May 14, 2007.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
What is Statistical Modeling
Sparse vs. Ensemble Approaches to Supervised Learning
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
An Overview of Data Analytics at DIMACS and DyDAn Paul Kantor Fred Roberts.
A Combinatorial Fusion Method for Feature Mining Ye Tian, Gary Weiss, D. Frank Hsu, Qiang Ma Fordham University Presented by Gary Weiss.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection Fred S. Roberts DIMACS, Rutgers University.
Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA
1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection.
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Sparse vs. Ensemble Approaches to Supervised Learning
Scalable Text Mining with Sparse Generative Models
Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
Introduction to machine learning
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
1 Probabilistic Language-Model Based Document Retrieval.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
1 / 12 PSLC Summer School, June 21, 2007 Identifying Students’ Gradual Understanding of Physics Concepts Using TagHelper Tools Nava L.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Classification Techniques: Bayesian Classification
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Ensemble Methods in Machine Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Brian Lukoff Stanford University October 13, 2006.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
KNN & Naïve Bayes Hongning Wang
Learning to Detect and Classify Malicious Executables in the Wild by J
Information Organization: Overview
Rutgers/DIMACS MMS Project
MMS Software Deliverables: Year 1
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Model generalization Brief summary of methods
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Information Organization: Overview
Presentation transcript:

Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages and Identifying Authors. Paul Kantor Rutgers May 14, 2007

Outline The Team Bayes’ Methods Method of Evaluation A toy Example Expert Knowledge Efficiency issues Entity Resolution Conclusions

Many collaborator Principals Fred Roberts David Madigan Dave D. Lewis Paul Kantor Programmers Vladimir Menkov Alex Genkin Now Ph.Ds Suhrid Balakrishnan Dmitriy Fradkin Aynur Dayanik Andrei Anghelescu REU Students Ross Sowell Diana Michalek Jordana Chord Melissa Mitchell

Overview of Bayes Personally I –go to Frequentist Church on Sunday –shop at Bayes’ the rest of the week

Making accurate predictions Be careful to not over fit the training data –Way to avoid this: use a prior distribution 2 types: Gaussian and Laplace Laplace

If you use Gaussian Prior For Gaussian prior(ridge): Every feature enters the models

If you use a LaPlace Prior For Laplace prior(Lasso): Features are added slowly, and require stronger evidence

Success Qualifier AUC =average distance, from the bottom of the list, of the items we’d like to see at the top. Null Hypothesis: Distributed as the averagesum of P uniform variates.

Test Corpus We took ten authors who were prolific authors between 1997 and 2002 and who had papers which were easy to disambiguate manually so that we could check the results of BBR. We then chose six KINDS OF features from these people‘s work to be used in training and testing BBR in the hopes that a specific Kind of features might prove more useful in identifying authors than another Keywords Co-Author Names Addresses (words) Abstract Addresses (n-grams) Title

Domain Knowledge and Optimization in Bayesian Logistic Regression Thanks to Dave Lewis David D. Lewis Consulting, LLC

Outline Bayesian logistic regression Advances Using domain knowledge to reduce the need for training data Speeding up training and classification Online training

Logistic Regression in Text and Data Mining Classification as a fundamental primitive –Text categorization: classes = content distinctions –Filtering: classes = user interests –Entity resolution: classes = entities Bayesian logistic regression –Probabilistic classifier allows combining outputs –Prior allows creating sparse models and combining training data and domain knowledge –Our KDD-funded BBR and BMR software now widely used

A Lasso Logistic Model (category “grain”)

Using Expert Knowledge in Text Classification What do we know about a category: –Category description (e.g. MESH- MEdical Subject Headings) –Human knowledge of good predictor words –Reference materials (e.g. CIA Factbook) All give clues to good predictor words for a category –We convert these to a prior on parameter values for words –Other classification tasks, e.g. entity resolution, have expert knowledge also

Constructing Informative Prior Distributions from Domain Knowledge in Text Classification Constructing Informative Prior Distributions from Domain Knowledge in Text Classification Aynur Dayanik, David D. Lewis, David Madigan, Vladimir Menkov, and Alexander Genkin, January 2006

Corpora TREC Genomics. Presence or absence of certain mesh headings ModApte “top 10 categories” (Wu and Srihari) RCV1 A-B ……see next slide…...

Categories. We selected a subset of the Reuters Region categories whose names exactly matched the names of geographical regions with entries in the CIA World Factbook (see below) and which had one or more positive examples in our large (23, 149 document) training set. There were 189 such matches, from which we chose the 27 with names beginning with the letter A or B to work with, reserving the rest for future use.

Some experimental results Documents are represented by the so-called tf.idf representation. Prior information can be used either to change the variance, or set the mode (an offset or bias). Those results are shown in red. Lasso with no prior information is shown in black.

Large Training Sets (3700 to examples) Var/TFIDF Mode/TFIDF Var/TFIDF Mode/TFIDF Ridge Lasso RCV1-v2ModApteBio Articles

Tiny Training Sets (5 positive examples + 5 random examples) Var/TFIDF Mode/TFIDF Var/TFIDF Mode/TFIDF Ridge Lasso RCV1-v2ModApteBio Articles

Findings For lots of training data, adding domain knowledge doesn’t help much For little training data, it helps more, and more often. It is more effective when used to set the priors, than when used as “additional training data”.

Speeding Up Classification Completed new version of BMRclassify –Replaces old BBRclassify and BMRclassify More flexible –Can apply 1000’s of binary and polytomous classifiers simultaneously –Allows meaningful names for features Inverted index to classifier suites for speed –25x speedup over old BMRclassify and BBRclassify

Online & Reduced Memory Training Rapid updating of classifiers as new data arrives Use training sets too big to fit in memory –Larger training sets, when available, give higher accuracy

KDD Entity Resolution Challenges ER1b: Is this pair of persons who have been possibly renamed, both really the same person? ER2a: Which of these persons, in the author list, is using a pseudonym?

KDD Entity Resolution Challenges ER1b: Is this pair of persons who have been possibly renamed, both really the same person? (NO) Smith and Jones Smith Jones and Wesson ER2a: Which of these persons, in the author list, is using a pseudonym? YES This one!

Conclusions drawn from KDD-1 On ER1b several of our submissions topped the rankings based on accuracy (the only measure used). Best: dimacs-er1b- modelavg-tdf-aaan –probabilities for all document pairs from 11 CLUTO methods and 1 Doc. Sim. model; summed; Vectors included some author address information

And …. –dimacs-er1b-modelavg-tdf-noaaan was second –no AAAN info in the vectors. –Third:dimacs-er1b-modelavg-binary-noaaan – combining many models, no binary representation. –CONCLUSION: Model averaging (alias: data fusion, combination) is better than any of the parts.

Conclusions on ER2a On ER2a measures were accuracy, squared error, ROC area (AUC) and cross entropy. –Our : dimacs-er2a-single-X; 3rd (accuracy) ; 4th by AUC. –trained binary logistic regression, using binary vectors (with AAAN info). –probability for “no replacement” = product of conditional probabilities for individual authors. Some post-processing using AAAN info. –No information from the text itself

And …... –Omitting the AAAN post-processing (dimacs-er2a- single), somewhat worse (4th by accuracy and 6th by AUC). Every kind of information helps, even if it is there by accident.

ER2: Which authors belong:The affinity of authors to each other (naïve)

ER2: Which authors belong? More sophisticated model Update the probability that y is an author of D yielding, after some work, the formula:

Read More at: Simulated Entity Resolution by Diverse Means: DIMACS Work on the KDD Challenge of 2005 Andrei Anghelescu, Aynur Dayanik, Dmitriy Fradkin, Alex Genkin, Paul Kantor, David Lewis, David Madigan, Ilya Muchnik and Fred Roberts, December 2005, DIMACS Technical Report Simulated Entity Resolution by Diverse Means: DIMACS Work on the KDD Challenge of 2005

Read more at ERS/tc-dk.pdf

Streaming algorithms L1 Good performance Algorithms for Sparse Linear Classifiers in the Massive Data Setting. S. Balakrishnan and D. Madigan. Journal of Machine Learning Research, submitted, 2006.Algorithms for Sparse Linear Classifiers in the Massive Data Setting

Summary of findings With little training data, Bayesian methods work better when they can use general knowledge about the target group To determine whether several “records” refer to the same “person” there is no “magic bullet” and combining many methods is feasible and powerful to detect an imposter in a group, “social” methods based on combing probabilities are effective (we did not use info in the papers)