April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Search Engines and Information Retrieval
Taxonomies in Electronic Records Management Systems May 21, 2002.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Authors:Jochen Dijrre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Nicholas Romano Text Mining: Finding Nuggets in Mountains.
Author : Jochen Dijrre, Peter Gerstl, Roland Seiffert Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
Web Services and Application of Multi-Agent Paradigm for DL Yueyu Fu & Javed Mostafa School of Library and Information Science Indiana University, Bloomington.
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Decision Support Systems
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Text Mining: Finding Nuggets in Mountains of Textual Data
Best pTree organization? level-1 gives te, tf (term level)
What Is Cluster Analysis?
Information Organization: Overview
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Restrict Range of Data Collection for Topic Trend Detection
Text Mining: Finding Nuggets in Mountains of Textual Data
Text Categorization Assigning documents to a fixed set of categories
Web Mining Research: A Survey
Information Organization: Overview
Text Mining Application Programming Chapter 9 Text Categorization
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter: Tyler Carr

April 22, 2004Motivation2 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Motivation3 Customer Letters Correspondence Phone Call Recordings Contracts Technical Documentation Patents News Articles Web Pages 90% of company’s data cannot be looked at with standard Datamining:

April 22, 2004Motivation4 Value of Text Mining Rapid Digestion of large document collections Faster than human knowledge brokers Objective and Customizable Analysis Automation of tasks

April 22, 2004Motivation5 Typical Applications Summarizing Documents Monitoring relations among people, places, and organizations Organizing documents by content Organizing indices for search and retrieval (keyword finding) Retrieving documents by content

April 22, 2004Methodology6 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Methodology7 Challenges in Text Mining Information is in unstructured textual form Natural Language (NL) interpretation is years away for computers Text Mining deals with huge collections of documents

April 22, 2004Methodology8 Two Text Mining Approaches Knowledge Discovery Extraction of codified information (features) Information Distillation Analysis of the feature distribution

April 22, 2004Methodology9 Comparison with Data Mining Data Mining Identify data sets Select features manually Prepare data Analyze distribution Text Mining Identify documents Extract features Select features by algorithm Prepare data Analyze distribution

April 22, 2004Feature Extraction10 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Feature Extraction11 Feature Extraction “To recognize and classify significant vocabulary items in unrestricted natural language texts.” Classes of Vocabulary Proper names Technical phrases Abbreviations and acronyms …

April 22, 2004Feature Extraction12 Canonical Forms Numbers convert to normal form Four ==> 4 Date convert to normal form Inflected forms convert to common form Sings, Sang, Sung ==> Sing Alternative names convert to explicit form Mr. Carr, Tyler, Presenter==>Tyler Carr

April 22, 2004Feature Extraction13 Feature Extraction Tools Linguistically motivated heuristics Pattern matching Limited amounts of lexical information Part-of-speech information (subject,verb) Avoid analyzing too deep (for speed) Does not use huge amounts of lexical info. No in-depth syntactic and semantic analysis

April 22, 2004Feature Extraction14 Feature Extraction Example Disambiguating Proper Names (Nominator Program) Apply heuristics to strings, instead of interpreting semantics. The unit of context for extraction is a document. The heuristics represent English naming conventions.

April 22, 2004Feature Extraction15 Feature Extraction Goals Very fast processing to deal with huge amounts of data Domain independence for general applicability

April 22, 2004Clustering and Categorization16 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Clustering and Categorization17 Clustering Also called Knowledge Discovery Fully automatic process Partitions a given collection into groups of documents similar in contents Clusters identifiable by feature vectors Provides a set of keywords for each cluster

April 22, 2004Clustering and Categorization18 Two Clustering Engines Hierarchical Clustering tool Orders the clusters into a tree reflecting various levels of similarity. Binary Relational Clustering tool Produces a flat clustering together with relationships of different strength between the clusters Relationships reflect inter-cluster similarities

April 22, 2004Clustering and Categorization19 Clustering Model

April 22, 2004Clustering and Categorization20 Categorization Also called Information Distillation Topic Categorization Tool Assigns documents to pre-existing categories (“topics” or “themes”) Categories are chosen to match the intended use of the collection

April 22, 2004Clustering and Categorization21 Categorization Categories defined by providing a set of sample documents for each category Training phase produces a special index, called the categorization schema Categorization tool returns set of category names and confidence levels for each document

April 22, 2004Clustering and Categorization22 Categorization If confidence is below some threshold, document is set aside for human categorizer Tests have shown the Topic Categorization Tool agrees with human categorizers to the same degree as human categorizers agree with one another.

April 22, 2004Clustering and Categorization23 Categorization Model

April 22, 2004Applications24 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Applications25 IBM Intelligent Miner for Text Software Development Kit (not full application) Contains necessary components for “real text mining” Also contains more traditional components: IBM Text Search Engine IBM Web Crawler Drop-in Intranet search solutions

April 22, 2004Applications26 Applications Customer Relationship Management application provided by IBM Intelligent Miner for text called Customer Relationship Intelligence (CRI) “Help companies better understand what their customers want and what they think about the company itself.”

April 22, 2004Applications27 Customer Intelligence Process Take body of communications with customer as input. Cluster the documents to identify issues. Characterize the clusters to identify the conditions for problems. Assign new messages appropriate to clusters.

April 22, 2004Applications28 Customer Intelligence Usage Knowledge Discovery Clustering used to create a structure that can be interpreted Information Distillation Refinement and extension of clustering results Interpreting the results Tuning of the clustering process Selecting meaningful clusters

April 22, 2004Exam Questions29 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Exam Questions30 Exam Question #1 Name an example of each of the two main classes of applications of text- mining. Knowledge Discovery: Discovering a common customer complaint among much feedback Information Distillation: Filtering future comments into pre-defined categories.

April 22, 2004Exam Questions31 Exam Question #2 How does the procedure for text mining differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select features Highly dimensional, sparsely populated feature vectors

April 22, 2004Exam Questions32 Exam Question #3 In the Nominator program of IBM’s Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or semantic analysis of texts

April 22, Thank You Any Questions?

April 22, Thank You Any Questions?

April 22, Thank You Any Questions?