Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

PARTITIONAL CLUSTERING

Decision Tree Approach in Data Mining

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

Classification Techniques: Decision Tree Learning

Lazy vs. Eager Learning Lazy vs. eager learning

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

What is Cluster Analysis?

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Classification.

Text Based Information Retrieval - Text Mining PKB - Antonie.

12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Data Mining Techniques

Data Mining Chun-Hung Chou

1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Basic Data Mining Technique

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

Ch10 Machine Learning: Symbol-Based

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Introduction to Text Mining By Soumyajit Manna 11/10/08.

Prepared by: Mahmoud Rafeek Al-Farra

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Text Clustering Hongning Wang

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Data Mining and Decision Support

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Machine Learning overview Chapter 18, 21

Machine Learning overview Chapter 18, 21

School of Computer Science & Engineering

Subject Name: Data Warehousing and data Mining

Data Mining K-means Algorithm

Topic 3: Cluster Analysis

Classification and Prediction

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Classification and Prediction

Statistical Learning Dong Liu Dept. EEIS, USTC.

©Jiawei Han and Micheline Kamber

Topic 5: Cluster Analysis

CSE572: Data Mining by H. Liu

Presentation transcript:

Text Mining IS698 Min Song

 The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge sharing -Leverage existing information access system -The Information Sources: - , groupware, online reports, … Example 1: KM People Finder

Example 1: Simple KM People Finder Relevant Docs Search or Navigation System Name Extractor Authority List Query Ranked People Names

Example 1: KM People Finder

textual (natural-language) dataAn exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge. Text Mining Definition  Many definitions in the literature “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.

“ previously unknown”  What is “ previously unknown” information ? Strict definition  Information that not even the writer knows.  e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure Lenient definition  Rediscover the information that the author encoded in the text  e.g., Automatically extracting a product’s name from a web-page. Text Mining Definition

Outline  Text mining applications  Text characteristics  Text mining process  Learning methods

Text Mining Applications  Marketing: Discover distinct groups of potential buyers according to a user text based profile e.g. amazon  Industry: Identifying groups of competitors web pages e.g., competing products and their prices  Job seeking: Identify parameters in searching for jobs e.g.,

 Information Retrieval Indexing and retrieval of textual documents  Information Extraction partial knowledge Extraction of partial knowledge in the text  Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web  Clustering Generating collections of similar text documents Text Mining Methods

Information Retrieval  Given: A source of textual documents A user query (text based) IR System Query E.g. Spam / Text Documents source Find: A set (ranked) of documents that are relevant to the query Ranked Documents Document

Intelligent Information Retrieval  meaning of words Synonyms “buy” / “purchase” Ambiguity “bat” (baseball vs. mammal)  order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park  user dependency for the data direct feedback indirect feedback  authority of the source IBM is more likely to be an authorized source then my second far cousin

 Given: A source of textual documents A well defined limited query (text based)  Find: relevant Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format What is Information Extraction?

Information Extraction: Example  Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.  Incident Date: 19 Apr 89  Incident Type: Bombing  Perpetrator Individual ID: “urban guerillas”  Human Target Name: “Roberto Garcia Alvarado” ...

What is Information Extraction? Extraction System Documents source Ranked Documents Relevant Info 1 Relevant Info 2 Relevant Info 3 Query 1 (E.g. job title) Query 2 (E.g. salary) Combine Query Results

Why Mine the Web?  Enormous wealth of textual information on the Web. Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)  Lots of data on user access patterns Web logs contain sequence of URLs accessed by users  Possible to retrieve “ previously unknown ” information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside San-Francisco

Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3. Web Spider

 The Web is a huge collection of documents where many contain: Hyper-link Hyper-link information Access and usage information  The Web is very dynamic Web pages are constantly being generated (removed) Unique Features of the Web Challenge: Develop new Web mining algorithms to... Exploit hyper-links and access patterns. Be adaptable to its documents source

 Combine the intelligent IR tools meaning meaning of words order order of words in the query user dependency user dependency for the data authority authority of the source  With the unique web features retrieve Hyper-link information utilize Hyper-link as input Intelligent Web Search

What is Clustering ?  Given: A source of textual documents Similarity measure  e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Find: Several clusters of documents that are relevant to each other

Outline  Text mining applications  Text characteristics  Text mining process  Learning methods

Text characteristics: Outline  Large textual data base  High dimensionality  Several input modes  Dependency  Ambiguity  Noisy data  Not well structured text

Text characteristics  Large textual data base Efficiency consideration  over 2,000,000,000 web pages  almost all publications are also in electronic form  High dimensionality (Sparse input) Consider each word/phrase as a dimension  Several input modes e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.

Text characteristics  Dependency relevant information is a complex conjunction of words/phrases  e.g., Document categorization. Pronoun disambiguation.  Ambiguity Word ambiguity  Pronouns (he, she …)  “buy”, “purchase” Semantic ambiguity  The king saw the rabbit with his glasses.

Text characteristics  Noisy data  Example: Spelling mistakes  Not well structured text Chat rooms  “r u available ?”  “Hey whazzzzzz up” Speech

Outline  Text mining applications  Text characteristics  Text mining process  Learning methods

Text mining process

 Text preprocessing Syntactic/Semantic text analysis  Features Generation Bag of words  Features Selection Simple counting Statistics  Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning  Analyzing results

 Part Of Speech (pos) tagging  Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun)  ~98% accurate.  Word sense disambiguation  Context basedproximity based  Context based or proximity based  Very accurate  Parsing parse tree  Generates a parse tree (graph) for each sentence  Each sentence is a stand alone graph Syntactic / Semantic text analysis

training set  Given: a collection of labeled records (training set) attributes label Each record contains a set of features (attributes), and the true class (label) model  Find: a model for the class as a function of the values of the features  Goal: previously unseen records should be assigned a class as accurately as possible test set A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition

Similarity Measures: Euclidean Distance Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents similarity measure  Given: a set of documents and a similarity measure among documents  Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another  Goal: correct Finding a correct set of documents Text Mining: Clustering definition

 Supervised learning (classification) labels Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set  Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Supervised vs. Unsupervised Learning

class result  Correct classification: The known label of test sample is identical with the class result from the classification model  Accuracy ratio: the percentage of test set samples that are correctly classified by the model distance measure  A distance measure between classes can be used e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”. Evaluation:What Is Good Classification?

 Good clustering method: produce high quality clusters with... intra-class high intra-class similarity inter-class low inter-class similarity quality hidden  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Evaluation: What Is Good Clustering?

Outline  Text mining applications  Text characteristics  Text mining process  Learning methods Classification Clustering

Classification: An Example categorical continuous class Training Set Model Learn Classifier Test Set

Text Classification: An Example class Training Set Model Learn Classifier text Test Set

Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification

 Instance-based (memory based) learning Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  k-nearest neighbor approach Instances points in a Euclidean space Instances (Examples) are represented as points in a Euclidean space Instance-based Methods

football Italian The English football football fan is a hooligan.. football Italian Similar to his English equivalent, Italian the Italian football football fan is a hooligan.. Text Examples in Euclidean Space

n  All instances correspond to points in the n- D space  The nearest neighbor are defined in terms of Euclidean distance. _ + + ? + _ _ + _ _ + _ _ _ + _ _ + k-NNkThe k-NN returns the most common value among the k nearest training examples 1-NNVoronoi diagram: the decision surface induced by 1-NN for a typical set of training examples K-Nearest Neighbor Algorithm

Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification

categorical continuous class Decision Tree: An Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm

class text Decision Tree: A Text Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm

 Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution  Decision tree generation consists of two phases: Tree construction Tree pruning noise outliers  Identify and remove branches that reflect noise or outliers  Use of decision tree: Classifying an unknown sample Test the attribute of the sample against the decision tree Classification by DT Induction

 Partitioning Methods  Hierarchical Methods Clustering Techniques

 Partitioning method: Construct a partition of n documents into a set of k clusters  Given: a set of documents and the number k  Find: a partition of k clusters that optimizes the chosen partitioning criterion Global optimal Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means k-means: Each cluster is represented by the center of the cluster Partitioning Algorithms

 k-means algorithm is implemented in 4 steps: k 1.Partition objects into k nonempty subsets. centroids 2.Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3.Assign each object to the cluster with the nearest seed point. 4.Go back to Step 2, stop when no more new assignment. The K-means Clustering Method

The K-means Clustering: Example

 Partitioning Methods  Hierarchical Methods Clustering Techniques

 Agglomerative: Start with each document being a single cluster. Eventually all document belong to the same cluster.  Divisive: Start with all document belong to the same cluster. Eventually each node forms a cluster on its own.  Does not require the number of clusters k in advance  Needs a termination condition The final mode in both Agglomerative and Divisive in of no use. Hierarchical Clustering

Step 0 b d c e a a b Step 1Step 2 d e Step 3 c d e Step 4 a b c d e agglomerative Step 4 Step 3Step 2Step 1Step 0 divisive Hierarchical Clustering: Example

Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). connectedClustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. A Dendogram: Hierarchical Clustering

Demo

Commercial Tools  IBM Intelligent Miner for Text  Semio Map  InXight LinguistX / ThingFinder  LexiQuest  ClearForest  Teragram  SRA NetOwl Extractor  Autonomy

 Text is tricky to process, but “ok” results are easily achieved text mining systems  There exist several text mining systems e.g., D2K - Data to Knowledge Intelligence  Additional Intelligence can be integrated with text mining One may play with any phase of the text mining process Summary

scientific and statistical text mining methods  There are many other scientific and statistical text mining methods developed but not covered in this talk. theoretical foundations  Also, it is important to study theoretical foundations of data mining. Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell