Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains.

Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains of Textual Data 1

Outline ●Definition and Paper Overview ●Motivation ●Methodology ●Software Packages ●Feature Extraction ●Clustering and Categorizing ●Some Applications ●Comparison with Data Mining ●Conclusion & Exam Questions 2

Definition ●Text Mining: ○ The discovery by computer of new, previously unknown information, by automatically extracting information from different unstructured textual documents. ○ Also referred to as text data mining, roughly equivalent to text analytics which refers more specifically to problems based in a business settings. 3

Paper Overview ●This paper introduced text mining and how it differs from data mining proper. ●Focused on the tasks of feature extraction and clustering/categorization ●Presented an overview of the tools/methods of IBM’s Intelligent Miner for Text 4

Motivation ●A large portion of a company’s data is unstructured or semi-structured – about 90% in 1999! Letters Emails Phone transcripts Contracts Technical documents Patents Web pages Articles 6

Unstructured Data 7 ChapterDateProblem 32-31-011/1/1999Water dripping on right hand lg. tom 9275-412 32-31-012/3/1999Phil, rough landing lg seems to have a crack 32-31-014/1/1999Saw leaking in the rh landing g. apr 1999

Text Mining Benefits ●Ability to quickly process large amounts of textual data ●“Objectivity” and customizability of the process ●Possibility to automate labor-intensive routine task 8

Typical Applications ●Summarizing documents ●Discovering/monitoring relations among people, places, organizations, etc ●Customer profile analysis ●Trend analysis ●Spam Identification ●Public health early warning ●Event tracks ●Predictive analytics 9

Methodology: Challenges ●Information is in unstructured textual form ●Natural language interpretation is difficult & complex task! (not fully possible) ○ Google and Watson are a step closer ●Text mining deals with huge collections of documents ○ Impossible for human examination 11

Google vs Watson ●Google justifies the answer by returning the text documents where it found the evidence. ●Google finds documents that are most suitable to a given Keyword. ●Watson tries to understand the semantics behind a given key phrase or question. ●Then Watson will use its huge knowledge base to find the correct answer. 12

Methodology: Two Aspects ●Knowledge Discovery ○ Feature Extraction ○ Mining proper – determining some structure ●Information Distillation ○ Analysis of feature distribution ○ Mining on the basis of some pre-established structure 13

Two Text Mining Approaches ●Extraction ○ Extraction of codified information from single documents ●Analysis ○ Analysis of the features to detect patterns, trends, and other similarities over whole collections of documents 14

IBM Intelligent Miner for Text ●IBM introduced Intelligent Miner for Text in 1998 ●SDK with: Feature extraction, clustering, categorization, and more ●Traditional components (search engine, etc) 16

IBM SPSS Text Analytics 17 ●Clustering/ categorization ●Extraction of words with ranking ●Produces graphical output

Advantages to IBM’s approach ●Processing is very fast (helps when dealing with huge amounts of data) ●Heuristics work reasonably well ●Generally applicable to any domain 18

SAS Text Miner ●Term profiling and trending ●Document theme discovery ●Visual integration of results 19

Feature Extraction ●Recognize and classify “significant” vocabulary items from the text ●Categories of vocabulary 21

Extracted Information Classified into Categories ●Names of persons, organizations, and places ●Multiword terms ●Abbreviations ●Relations ●Other useful stuff: numerical or textual forms of numbers, percentages, dates, currency amounts, etc. 22

Canonical Form Examples ●Normalize numbers, money ○ Four = 4, five-hundred dollars = $500 ●Conversion of date to normal form ○ 8/17/1992 = August 18 1992 ●Morphological variants ○ Drive, drove, driven = drive ●Proper names and other forms ○ Mr. Johnson, Bob Johnson, The author = Bob Johnson 23

Feature Extraction Approach ●Linguistically motivated heuristics ●Pattern matching ●Limited lexical information (part-of-speech) ●Avoid analyzing with too much depth ○ Does not use too much lexical information ○ No in-depth syntactic or semantic analysis 24

Feature Extraction Ex. 25 ChapterDateProblem 32-31-011/1/1999Water dripping on right hand lg. tom 9275-412 32-31-012/3/1999Phil, rough landing lg seems to have a crack 32-31-014/1/1999Saw leaking in the rh landing g. apr 1999

Clustering ●Fully automatic process ●Documents are grouped according to similarity of their feature vectors ●Each cluster is labeled by a listing of the common terms/keywords ●Good for getting an overview of a document collection 27

Two Clustering Engines ●Hierarchical clustering ○ Orders the clusters into a tree reflecting various levels of similarity ●Binary relational clustering ○ Flat clustering ○ Relationships of different strengths between clusters, reflecting similarity 28

Clustering Model 29

Categorization ●Assigns documents to preexisting categories ●Classes of documents are defined by providing a set of sample documents. ●Training phase produces “categorization schema” ●Documents can be assigned to more than one category ●If confidence is low, document is set aside for human intervention 30

Categorization Model 31

Applications ●Aircraft Faults using IBM SPSS Text Analytics ●Customer Relationship Management application provided by IBM Intelligent Miner for Text called “Customer Relationship Intelligence” or CRI ○ “Help companies better understand what their customers want and what they think about the company itself” 33

Aircraft Faults ●Take as input free-hand text from operators and aircraft mechanics ●Cluster the documents to identify faults ●Characterize the clusters to identify the conditions for faults ●Determine most common fault for a certain component 34

Customer Intelligence Process ●Take as input body of communications with customer ●Cluster the documents to identify issues ●Characterize the clusters to identify the conditions for problems ●Assign new messages to appropriate clusters 35

Applications Summary ●Knowledge Discovery ○ Clustering used to create a structure that can be interpreted ●Information Distillation ○ Refinement and extension of clustering results ■ Interpreting the results ■ Tuning of the clustering process ■ Selecting meaningful clusters 36

Comparison with Data Mining ●Data mining ○ Discover hidden models. ○ Tries to generalize all of the data into a single model. ○ Marketing, medicine, health care ●Text mining ○Discover hidden facts. ○Tries to understand the details, cross reference between individual instances ○Biosciences, customer profile analysis 38

Outline ●Definition and Paper Overview ●Motivation ●Methodology ●Software Packages ●Feature Extraction ●Clustering and Categorizing ●Some Applications ●Comparison with Data Mining ●Conclusion and Exam Questions 39

Conclusion ●Text mining can be used as an effective business tool that supports ○ Creation of knowledge by preparing and organizing unstructured textual data [Knowledge Discovery] ○ Extraction of relevant information from large amounts of unstructured textual data through automatic pre- selection based on user defined criteria [Information Distillation] 40

Exam Question #1 ●How does the procedure for text mining differ from the procedure for data mining? ○ Adds feature extraction phase ○ Infeasible for humans to select features manually ○ The feature vectors are, in general, highly dimensional and sparse 41

Exam Question #2 ●What is one application of text mining and why would that application be beneficial? ○ Customer Relationship Management application provided by IBM Intelligent Miner for Text called “Customer Relationship Intelligence” or CRI ○ “Help companies better understand what their customers want and what they think about the company itself” 42

Exam Question #3 ●What are three benefits of text mining? ○ 1. Efficiency ○ 2. Customizability ○ 3. Automation of task 43

Questions? 44

Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains.

Similar presentations

Presentation on theme: "Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains.

Similar presentations

Presentation on theme: "Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains."— Presentation transcript:

Similar presentations

About project

Feedback