April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter: Tyler Carr

April 22, 2004Motivation2 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Motivation3 Customer Letters E-Mail Correspondence Phone Call Recordings Contracts Technical Documentation Patents News Articles Web Pages 90% of company’s data cannot be looked at with standard Datamining:

April 22, 2004Motivation4 Value of Text Mining Rapid Digestion of large document collections Faster than human knowledge brokers Objective and Customizable Analysis Automation of tasks

April 22, 2004Motivation5 Typical Applications Summarizing Documents Monitoring relations among people, places, and organizations Organizing documents by content Organizing indices for search and retrieval (keyword finding) Retrieving documents by content

April 22, 2004Methodology6 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Methodology7 Challenges in Text Mining Information is in unstructured textual form Natural Language (NL) interpretation is years away for computers Text Mining deals with huge collections of documents

April 22, 2004Methodology8 Two Text Mining Approaches Knowledge Discovery Extraction of codified information (features) Information Distillation Analysis of the feature distribution

April 22, 2004Methodology9 Comparison with Data Mining Data Mining Identify data sets Select features manually Prepare data Analyze distribution Text Mining Identify documents Extract features Select features by algorithm Prepare data Analyze distribution

April 22, 2004Feature Extraction10 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Feature Extraction11 Feature Extraction “To recognize and classify significant vocabulary items in unrestricted natural language texts.” Classes of Vocabulary Proper names Technical phrases Abbreviations and acronyms …

April 22, 2004Feature Extraction12 Canonical Forms Numbers convert to normal form Four ==> 4 Date convert to normal form Inflected forms convert to common form Sings, Sang, Sung ==> Sing Alternative names convert to explicit form Mr. Carr, Tyler, Presenter==>Tyler Carr

April 22, 2004Feature Extraction13 Feature Extraction Tools Linguistically motivated heuristics Pattern matching Limited amounts of lexical information Part-of-speech information (subject,verb) Avoid analyzing too deep (for speed) Does not use huge amounts of lexical info. No in-depth syntactic and semantic analysis

April 22, 2004Feature Extraction14 Feature Extraction Example Disambiguating Proper Names (Nominator Program) Apply heuristics to strings, instead of interpreting semantics. The unit of context for extraction is a document. The heuristics represent English naming conventions.

April 22, 2004Feature Extraction15 Feature Extraction Goals Very fast processing to deal with huge amounts of data Domain independence for general applicability

April 22, 2004Clustering and Categorization16 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Clustering and Categorization17 Clustering Also called Knowledge Discovery Fully automatic process Partitions a given collection into groups of documents similar in contents Clusters identifiable by feature vectors Provides a set of keywords for each cluster

April 22, 2004Clustering and Categorization18 Two Clustering Engines Hierarchical Clustering tool Orders the clusters into a tree reflecting various levels of similarity. Binary Relational Clustering tool Produces a flat clustering together with relationships of different strength between the clusters Relationships reflect inter-cluster similarities

April 22, 2004Clustering and Categorization19 Clustering Model

April 22, 2004Clustering and Categorization20 Categorization Also called Information Distillation Topic Categorization Tool Assigns documents to pre-existing categories (“topics” or “themes”) Categories are chosen to match the intended use of the collection

April 22, 2004Clustering and Categorization21 Categorization Categories defined by providing a set of sample documents for each category Training phase produces a special index, called the categorization schema Categorization tool returns set of category names and confidence levels for each document

April 22, 2004Clustering and Categorization22 Categorization If confidence is below some threshold, document is set aside for human categorizer Tests have shown the Topic Categorization Tool agrees with human categorizers to the same degree as human categorizers agree with one another.

April 22, 2004Clustering and Categorization23 Categorization Model

April 22, 2004Applications24 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Applications25 IBM Intelligent Miner for Text Software Development Kit (not full application) Contains necessary components for “real text mining” Also contains more traditional components: IBM Text Search Engine IBM Web Crawler Drop-in Intranet search solutions

April 22, 2004Applications26 Applications Customer Relationship Management application provided by IBM Intelligent Miner for text called Customer Relationship Intelligence (CRI) “Help companies better understand what their customers want and what they think about the company itself.”

April 22, 2004Applications27 Customer Intelligence Process Take body of communications with customer as input. Cluster the documents to identify issues. Characterize the clusters to identify the conditions for problems. Assign new messages appropriate to clusters.

April 22, 2004Applications28 Customer Intelligence Usage Knowledge Discovery Clustering used to create a structure that can be interpreted Information Distillation Refinement and extension of clustering results Interpreting the results Tuning of the clustering process Selecting meaningful clusters

April 22, 2004Exam Questions29 Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

April 22, 2004Exam Questions30 Exam Question #1 Name an example of each of the two main classes of applications of text- mining. Knowledge Discovery: Discovering a common customer complaint among much feedback Information Distillation: Filtering future comments into pre-defined categories.

April 22, 2004Exam Questions31 Exam Question #2 How does the procedure for text mining differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select features Highly dimensional, sparsely populated feature vectors

April 22, 2004Exam Questions32 Exam Question #3 In the Nominator program of IBM’s Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or semantic analysis of texts

April 22, 200433 Thank You Any Questions?

April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Similar presentations

Presentation on theme: "April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Similar presentations

Presentation on theme: "April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:"— Presentation transcript:

Similar presentations

About project

Feedback