CRM Segmentation Segmentation of Textual Data Zhangxi Lin.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Dimensionality Reduction PCA -- SVD
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
SEMANTIC FEATURE ANALYSIS IN RASTER MAPS Trevor Linton, University of Utah.
Information Retrieval in Practice
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Recommender systems Ram Akella November 26 th 2008.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Dan Simon Cleveland State University
Overview of Search Engines
Decision Tree Models in Data Mining
Introduction to machine learning
Information Retrieval in Practice
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining Techniques
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Chapter 2 Dimensionality Reduction. Linear Methods
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Toman, Steinberger, Ježek Searching and Summarizing in a Multilingual Environment Michal Toman, Josef Steinberger, Karel Ježek University of West Bohemia.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Decision Support Systems
SINGULAR VALUE DECOMPOSITION (SVD)
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Web- and Multimedia-based Information Systems Lecture 2.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
CSC 594 Topics in AI – Text Mining and Analytics
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Link Distribution on Wikipedia [0407]KwangHee Park.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Natural Language Processing Topics in Information Retrieval August, 2002.
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Vector Semantics Dense Vectors.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Text Based Information Retrieval
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Text Mining Application Programming Chapter 3 Explore Text
Restructuring Sparse High Dimensional Data for Effective Retrieval
Presentation transcript:

CRM Segmentation Segmentation of Textual Data Zhangxi Lin

2 Overview Text Mining Review Converting Unstructured Text to Structured Data Segmenting Textual Data Demonstrations

3 Text Mining Review

Text Mining – Why and How The volume of text data is much greater than that of numeric data The means dealing with text data is far from enough

5 What Text Mining Is Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects. “SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.” (SAS Text Miner: Distilling Textual Data for Competitive Business Advantage)

6 What Text Mining Is Not Text mining is not a text summarization tool an information extraction methodology a natural language processor.

7 Two Types of Document Data Document Text Field Separate Document Files (TMFILTER)

8 The SAS Text Mining Process 1.Preprocess document files to create a SAS data set.  TMFILTER macro  SAS language features 2.Parse the document field.  PARSE tab in Text Miner  Stemming  Part-of-speech tagging  Entities  Stop/start lists  Synonym lists  And so forth continued...

9 The SAS Text Mining Process 3.Derive the term by document frequency matrix.  The Text Miner Transform tab  Frequency weights  Term weights 4.Transform the term by document frequency matrix.  The Text Miner Transform Tab  Singular Value Decomposition (SVD)  Roll Up Terms 5.Perform the analysis.  Exploration  Clustering/unsupervised learning  Predictive modeling

10 Text Mining Strengths Clustering documents in a corpus Investigating word (token) distribution across documents within a corpus Identifying words with the highest discriminatory power Classifying documents into predefined categories Integrating text data with structured data to enrich predictive modeling endeavors

11 Text Mining Deficiencies Text mining algorithms perform poorly in distinguishing negations, for example: Herman was involved in a motor vehicle accident. Herman was NOT involved in a motor vehicle accident. Text mining cannot generally make value judgments, for example, classifying an article as positive or negative with respect to any tokens it contains.

12 Text Mining Deficiencies Text mining algorithms do not work well with large documents. Performance is slow. Increased term occurrence across documents decreases separation of documents.

13 The SAS Text Mining Process 1.Preprocess document files to create a SAS data set.  TMFILTER macro  SAS language features 2.Parse the document field.  PARSE tab in Text Miner  Stemming  Part-of-speech tagging  Entities  Stop/start lists  Synonym lists  And so forth continued...

14 The SAS Text Mining Process 3.Derive the term by document frequency matrix.  The Text Miner Transform tab  Frequency weights  Term weights 4.Transform the term by document frequency matrix.  The Text Miner Transform Tab  Singular Value Decomposition (SVD)  Roll Up Terms 5.Perform the analysis.  Exploration  Clustering/unsupervised learning  Predictive modeling

15 This demonstration illustrates how to use the TMFILTER macro to process groups of text files. Using the TMFILTER Macro with the Newsgroups Data

16 SAS Text Miner Text Processing Features Text parsing Removal of stop words Part-of-speech tagging Stems and synonym handling Entities

17 Stop Words Stop words are words that have little or no value in identifying a document or in comparing documents. Standard stop lists contain stop words that are Articles (the, a, this) Conjunctions (and, but, or) Prepositions (of, from, by). Custom stop lists identify low information words, like the word “computer” in a collection of articles about computers.

18 Sashelp.stoplst

19 Default Stop Lists A default stop list or a user-defined stop list defines stop words to be removed. Default stop lists: English: sashelp.stoplst French: sashelp.frchstop German: sashelp.grmnstop Mixed: sashelp.mixdstop

20 Stop List versus Start List Use a start list when –documents are dominated by “technical jargon” –domain expertise can enhance text mining. Use a stop list when –documents are loosely related: news, business reports, Internet searches –domain expertise is not available.

21 Issues in Creating a Start List Do not just add high frequency terms. –Low frequency terms that only appear in a few documents may be good discriminators. –High frequency terms may be candidates for a stop list. Data-derived start lists should be reviewed by domain experts.

22 Tagging Parts of Speech Determines if the word is a common noun, verb, adjective, proper noun, adverb, and so forth. Disambiguate parts of speech when a word is used in a different context, I wish that my bank had more ATM machines. You can bank on either Philadelphia or Oakland winning the Super Bowl next year. Settlers living on the west bank of the river were forced to relocate.

23 Tagging Parts of Speech in Text Miner continued...

24 Tagging Parts of Speech in Text Miner

25 Stemming May employ algorithm and/or table look up –Porter stemmer –Levin stemmer Errors of commission (organization  organ) Errors of omission (matrices  matrix) Can be related to spell checking continued...

26 Stemming Examples BIG: BIG, BIGGER, BIGGEST REACH: REACH, REACHES, REACHED, REACHING WORK: WORK, WORKS, WORKED, WORKING CHILD: CHILD, CHILDREN KNIFE: KNIFE, KNIVES PERRO: PERRO, PERRA (Spanish, male and female dog)

27 Stemming in Text Miner continued...

28 Stemming in Text Miner Text Miner performs stemming to derive stem synonyms, for example, run/ran/runs/running, and combines these with defined synonyms, for example, run/sprint. The default synonym data set for Text Miner, sashelp.engsynms, is primarily for illustration. Synonyms may split based on part of speech, for example, teach/train=verb, locomotive/train=noun.

29 Synonyms Language dictionaries Technical jargon Abbreviations Specialty dictionaries Note: This could be associated with stemming in file preprocessing. continued...

30 Synonyms instruct train educate teach

31 Synonym Lists Default: sashelp.engsynms (Tools  Settings) User Defined –SAS Data Set –Three fields: TERM ($25.), PARENT ($25.), CATEGORY ($12.) –Example: TERM=EM, PARENT=Enterprise Miner, CATEGORY=PRODUCT continued...

32 Converting Unstructured Text to Structured Data

33 Term-Document Frequency Matrices TermIDD1D2DnDn … T11D1,1D1,2D1,n … T22D2,1D2,2D2,n … … … Di,j=count of number of times word i occurs in document j Documents

34 Term-Document Frequency Matrices Pitfalls Sparse cells (many zeroes) Weak discriminatory power Too large Solution Term frequency functions Singular value decomposition

35 Weighted Term-Document Frequency Matrix Word phraseIDD1D2DnDn … T11W1,1W1,2W1,n … T22W2,1W2,2W2,n … … … Wi,j=weight associated with word i in document j

36 Frequency and Term Weights Notation

37 Deriving the Weighted Term-Document Frequency Matrix

38 Transformed Term-Document Frequency Matrix Elements The original frequencies in the Term-Document Frequency Matrix are transformed to the “expected” frequencies

39 Default Weights Term Weight=Entropy Frequency Weight=Log

40 Frequency Weights Log Binary None

41 Singular Value Decomposition Classical SVD in statistics: A=U  V For term-document frequency matrix A, U is the matrix of term vectors,  is a diagonal matrix with singular values along the diagonal, and V is the matrix of document vectors. The projection  V* is output as a set of SVD dimensions for each document, with the dimensions stored in the variables COL1, COL2, and so forth. V* is a sub-matrix of V determined by the maximum dimension specified by the user. Resolution (low/medium/high) changes the cutoff value for selecting a “significant” number of dimensions.

42 SVD is very useful for Compression Noise reduction Finding ”concepts” or ”topics” (text mining/LSI) Data exploration and visualizing data (e.g. spatial data/PCA) Classification (of e.g. handwritten digits)

43 SVD appears under different names Principal Component Analysis (PCA) Latent Semantic Indexing (LSI)/Latent Semantic Analysis (LSA) Karhunen-Loeve expansion/Hotelling transform (in image processing)

Segmenting Textual Data

45 The Text Mining Project Document analysis is the goal of the project –Exploratory analysis of document collections –Clustering of documents as an aid to human evaluation of documents

46 Text Mining as Part of a Data Mining Project Predictive modeling with many fields, one or more of which are unstructured text Recommender systems Others

Precision vs. Recall Measure to describe how effective a binary text classifier predicts documents that are relevant to a particular category. Precision – The percentage of the predicted positive in all positive instances –Precision = TP / (TP + FN) Recall – How well the classifier can find relevant documents and properly assign them to their correct category –Recall = TP / (TP + FP) 47

48 Text Mining as Part of a Data Mining Project The goals of the project influence how text mining is performed. A single unstructured text field becomes a set of K quantitative inputs.

Memory-Based Reasoning Memory-based reasoning is a process that identifies similar cases and applies the information that is obtained from these cases to a new record. In Enterprise Miner, the Memory-Based Reasoning node is a modeling tool that uses a k-nearest neighbor algorithm to categorize or predict observations. The k-nearest neighbor algorithm takes a data set and a probe, where each observation in the data set is composed of a set of variables and the probe has one value for each variable. The distance between an observation and the probe is calculated. The k observations that have the smallest distances to the probe are the k-nearest neighbor to that probe. 49