MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection Paul Kantor, Dave Lewis, David Madigan, Fred Roberts DIMACS, Rutgers University.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection Fred S. Roberts DIMACS, Rutgers University.
Model Personalization (1) : Data Fusion Improve frame and answer (of persistent query) generation through Data Fusion (local fusion on personal and topical.
1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR. Paul Kantor June 2, 2003 Research supported in part.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Scalable Text Mining with Sparse Generative Models
Overview of Search Engines
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Presented by Tienwei Tsai July, 2005
Today Ensemble Methods. Recap of the course. Classifier Fusion
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Algorithmic Detection of Semantic Similarity WWW 2005.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Introduction to Machine Learning, its potential usage in network area,
Information Retrieval in Practice
Information Retrieval in Practice
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
SNS COLLEGE OF TECHNOLOGY
Data Transformation: Normalization
Evaluation Anisio Lacerda.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Evaluation of IR Systems
An Empirical Study of Learning to Rank for Entity Search
Preface to the special issue on context-aware recommender systems
Rutgers/DIMACS MMS Project
MMS Software Deliverables: Year 1
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Random Sampling over Joins Revisited
A Unifying View on Instance Selection
Document Visualization at UMBC
Data Warehousing and Data Mining
Presented by: Prof. Ali Jaoua
EPSY 5245 EPSY 5245 Michael C. Rodriguez
A Potpurri of topics Paul Kantor
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Lecture 8 Information Retrieval Introduction
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Text Categorization Berlin Chen 2003 Reference:
Concave Minimization for Support Vector Machine Classifiers
Relevance and Reinforcement in Interactive Browsing
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Kostas Kolomvatsos, Christos Anagnostopoulos
CSE591: Data Mining by H. Liu
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized communications to automatically identify clusters of messages relating to significant "events." Kantor © 2002 MAy082002

Research Components (1) compression of text to meet storage and processing limitations; (2) representation of text into a form amenable to computation and statistical analysis; (3) a matching scheme for computing similarity between documents in terms of the representation chosen; (4) a learning method building on a set of judged examples to determine the key characteristics of a document cluster or "event"; and (5) a fusion scheme that combines methods that are "sufficiently different" to yield improved detection and clustering of documents. Kantor © 2002 MAy082002

Approach/objectives sophisticated dimension reduction methods in a preprocessing stage sophisticated statistical tools in later stages goal: identify the best combination of such newer methods through a careful exploration of a variety of tools. efficiency (in computational time and space) combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives. Kantor © 2002 MAy082002

Approach/objective (2) “semi-supervised” learning human analysts help to focus on features most indicative of change or anomaly algorithms assess whether incoming documents deviate “significantly” on those features. New techniques needed to represent the data facilitating flagging of significant deviation (“abnormality”) through an appropriately defined metric of new clustering algorithms Build on the analyst-designated features. Kantor © 2002 MAy082002

Data Sets TREC Data. 5CDs. Some subsets time stamped. Scores available for filtering and routing tasks (10^5-10^6 messages Reuters Volume 1. 8x10^5 messages Google potentially 10^7 messages – Usenet set MEDLINE 10^7 Kantor © 2002 MAy082002

Judges/Experts Existing collections often contain judgments regarding relevance to a query (TREC) or classification information that can be treated as surrogate judgments (Reuters; MEDLINE) We would benefit greatly from continuous interaction with with real analysts, to understand the types of judgments and classifications that are salient to them. Some overlap with Strzalkowski/Kantor work for AQUAINT (HITIQA project) Kantor © 2002 MAy082002

Work Phase Ia *       Prepare available corpora of data on which to uniformly test different combinations of methods. (Kantor, Lewis) *       Systematically explore combinations of methods for the supervised learning task (choosing promising combinations of compression, representation method, matching scheme, learning scheme, and fusion method). * hold a related workshop at IDA-CCR in Princeton (paid for by IDA-CCR). (Boros, Kantor, Lewis, Madigan, Muchnik) *       Test combinations of methods on common data sets, starting with the smaller ones, exchange information among researchers developing/testing different combinations of methods. Identify promising methods for further development. (Boros, Kantor, Lewis, Madigan, Muchnik, Roberts) *       Develop promising compression/ dimension reduction methods, especially for "streaming data" analysis. (Muchnik, Muthukrishnan, Ostrovsky, Strauss) Kantor © 2002 MAy082002

Work Phase Ib *       Refine available corpora of data on which to uniformly test different combinations of methods. (Kantor, Lewis) *       Extend the systematic exploration of combinations of methods for the supervised learning task. Establish the limits of these technologies including rates of convergence and probabilities of success. Code combined methods for experimental purposes. (Boros, Kantor, Lewis, Madigan, Muchnik) *       Test combinations of methods on common data sets, working up to the larger ones, exchange information. Identify promising methods for further development. (Boros, Kantor, Lewis, Madigan, Muchnik, Roberts) *       Develop and test promising compression/dimension reduction methods, especially for "streaming data" analysis. (Muchnik, Muthukrishnan, Ostrovsky, Strauss) Kantor © 2002 MAy082002

Specific Responsibilities We can best discuss our individual responsibilities with reference to a cross matrix of activities by areas of focus Activities (Aspect of the project) Algorithms Code Evaluation Dissemination Focus (Specific Research topic) Compression Representation Matching Learning Fusion Kantor © 2002 MAy082002

Allocation of Responsibility FOCUS ASPECT Algorithm Code Evaluate Dissemination Compression Representation Matching Learning Fusion PBK Kantor © 2002 MAy082002

Kantor (details) Fusion Linear; quadratic; non-parametric Perl/awk low speed hacker code TREC scores; Reuters Classifications etc. Papers; web site; source codes (low documentation) Message entities {m}, are treated with various representation methods m->Rm; Inter-entity similarity is computed using various methods S m,m’ -> S(Rm,Rm’) Fusion combines methods to form metamethods Kantor © 2002 MAy082002

Fusion At the representation level: At the similarity level R,R’, R’’  R* which combines them. Example: Direct Sum of vector spaces; At the similarity level S,S’, S’’  S* which produces a composite similarity score. Examples: weighted sums; maximum, minimum, non-linear forms Kantor © 2002 MAy082002

The “Fusion Program” Given a space of representation methods {R} and of similarity methods {S} answer these questions: Which methods can be combined to give results better than the best single method? What forms of combination (fusion), for those methods, produce the best results? When is fusion called for and when should it be avoided? Kantor © 2002 MAy082002

Fusion: method Empirical exploration of the space of possibilities, guided by applicable principles from statistics either directly or via machine learning. Evaluation using the existing sets of classified or judged message texts. Kantor © 2002 MAy082002

Fusion:Deliverables CONCEPTUAL: Systematic tabulation of the effectiveness of various fusion approaches, applied to the set of representations and similarity schemes (a) commonly available [the LEMUR toolkit] and developed in this project [LAD, random projection, ..] USABLE: simple codes for performing various kinds of fusion, both ad-hoc and adaptively. DISSEMINATION: Reports, papers, web-site END KANTOR Kantor © 2002 MAy082002