Download presentation
Presentation is loading. Please wait.
1
Text Mining IS698 Min Song
2
The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge sharing -Leverage existing information access system -The Information Sources: -Email, groupware, online reports, … Example 1: KM People Finder
3
Example 1: Simple KM People Finder Relevant Docs Search or Navigation System Name Extractor Authority List Query Ranked People Names
4
Example 1: KM People Finder
5
textual (natural-language) dataAn exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge. Text Mining Definition Many definitions in the literature “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.
6
“ previously unknown” What is “ previously unknown” information ? Strict definition Information that not even the writer knows. e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure Lenient definition Rediscover the information that the author encoded in the text e.g., Automatically extracting a product’s name from a web-page. Text Mining Definition
7
Outline Text mining applications Text characteristics Text mining process Learning methods
8
Text Mining Applications Marketing: Discover distinct groups of potential buyers according to a user text based profile e.g. amazon Industry: Identifying groups of competitors web pages e.g., competing products and their prices Job seeking: Identify parameters in searching for jobs e.g., www.flipdog.com
9
Information Retrieval Indexing and retrieval of textual documents Information Extraction partial knowledge Extraction of partial knowledge in the text Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web Clustering Generating collections of similar text documents Text Mining Methods
10
Information Retrieval Given: A source of textual documents A user query (text based) IR System Query E.g. Spam / Text Documents source Find: A set (ranked) of documents that are relevant to the query Ranked Documents Document
11
Intelligent Information Retrieval meaning of words Synonyms “buy” / “purchase” Ambiguity “bat” (baseball vs. mammal) order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park user dependency for the data direct feedback indirect feedback authority of the source IBM is more likely to be an authorized source then my second far cousin
12
Given: A source of textual documents A well defined limited query (text based) Find: relevant Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format What is Information Extraction?
13
Information Extraction: Example Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured. Incident Date: 19 Apr 89 Incident Type: Bombing Perpetrator Individual ID: “urban guerillas” Human Target Name: “Roberto Garcia Alvarado” ...
14
What is Information Extraction? Extraction System Documents source Ranked Documents Relevant Info 1 Relevant Info 2 Relevant Info 3 Query 1 (E.g. job title) Query 2 (E.g. salary) Combine Query Results
15
Why Mine the Web? Enormous wealth of textual information on the Web. Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint) Lots of data on user access patterns Web logs contain sequence of URLs accessed by users Possible to retrieve “ previously unknown ” information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside San-Francisco
16
Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3. Web Spider
17
The Web is a huge collection of documents where many contain: Hyper-link Hyper-link information Access and usage information The Web is very dynamic Web pages are constantly being generated (removed) Unique Features of the Web Challenge: Develop new Web mining algorithms to... Exploit hyper-links and access patterns. Be adaptable to its documents source
18
Combine the intelligent IR tools meaning meaning of words order order of words in the query user dependency user dependency for the data authority authority of the source With the unique web features retrieve Hyper-link information utilize Hyper-link as input Intelligent Web Search
19
What is Clustering ? Given: A source of textual documents Similarity measure e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Find: Several clusters of documents that are relevant to each other
20
Outline Text mining applications Text characteristics Text mining process Learning methods
21
Text characteristics: Outline Large textual data base High dimensionality Several input modes Dependency Ambiguity Noisy data Not well structured text
22
Text characteristics Large textual data base Efficiency consideration over 2,000,000,000 web pages almost all publications are also in electronic form High dimensionality (Sparse input) Consider each word/phrase as a dimension Several input modes e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.
23
Text characteristics Dependency relevant information is a complex conjunction of words/phrases e.g., Document categorization. Pronoun disambiguation. Ambiguity Word ambiguity Pronouns (he, she …) “buy”, “purchase” Semantic ambiguity The king saw the rabbit with his glasses.
24
Text characteristics Noisy data Example: Spelling mistakes Not well structured text Chat rooms “r u available ?” “Hey whazzzzzz up” Speech
25
Outline Text mining applications Text characteristics Text mining process Learning methods
26
Text mining process
27
Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results
28
Part Of Speech (pos) tagging Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) ~98% accurate. Word sense disambiguation Context basedproximity based Context based or proximity based Very accurate Parsing parse tree Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph Syntactic / Semantic text analysis
29
training set Given: a collection of labeled records (training set) attributes label Each record contains a set of features (attributes), and the true class (label) model Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible test set A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition
30
Similarity Measures: Euclidean Distance Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents similarity measure Given: a set of documents and a similarity measure among documents Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another Goal: correct Finding a correct set of documents Text Mining: Clustering definition
31
Supervised learning (classification) labels Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Supervised vs. Unsupervised Learning
32
class result Correct classification: The known label of test sample is identical with the class result from the classification model Accuracy ratio: the percentage of test set samples that are correctly classified by the model distance measure A distance measure between classes can be used e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”. Evaluation:What Is Good Classification?
33
Good clustering method: produce high quality clusters with... intra-class high intra-class similarity inter-class low inter-class similarity quality hidden The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Evaluation: What Is Good Clustering?
34
Outline Text mining applications Text characteristics Text mining process Learning methods Classification Clustering
35
Classification: An Example categorical continuous class Training Set Model Learn Classifier Test Set
36
Text Classification: An Example class Training Set Model Learn Classifier text Test Set
37
Classification Techniques Instance-Based Methods Decision trees Neural networks Bayesian classification
38
Instance-based (memory based) learning Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified k-nearest neighbor approach Instances points in a Euclidean space Instances (Examples) are represented as points in a Euclidean space Instance-based Methods
39
football Italian The English football football fan is a hooligan.. football Italian Similar to his English equivalent, Italian the Italian football football fan is a hooligan.. Text Examples in Euclidean Space
40
n All instances correspond to points in the n- D space The nearest neighbor are defined in terms of Euclidean distance. _ + + ? + _ _ + _ _ + _ + + + + _ _ + _ _ + k-NNkThe k-NN returns the most common value among the k nearest training examples 1-NNVoronoi diagram: the decision surface induced by 1-NN for a typical set of training examples K-Nearest Neighbor Algorithm
41
Classification Techniques Instance-Based Methods Decision trees Neural networks Bayesian classification
42
categorical continuous class Decision Tree: An Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm
43
class text Decision Tree: A Text Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm
44
Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases: Tree construction Tree pruning noise outliers Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute of the sample against the decision tree Classification by DT Induction
45
Partitioning Methods Hierarchical Methods Clustering Techniques
46
Partitioning method: Construct a partition of n documents into a set of k clusters Given: a set of documents and the number k Find: a partition of k clusters that optimizes the chosen partitioning criterion Global optimal Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means k-means: Each cluster is represented by the center of the cluster Partitioning Algorithms
47
k-means algorithm is implemented in 4 steps: k 1.Partition objects into k nonempty subsets. centroids 2.Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3.Assign each object to the cluster with the nearest seed point. 4.Go back to Step 2, stop when no more new assignment. The K-means Clustering Method
48
The K-means Clustering: Example
49
Partitioning Methods Hierarchical Methods Clustering Techniques
50
Agglomerative: Start with each document being a single cluster. Eventually all document belong to the same cluster. Divisive: Start with all document belong to the same cluster. Eventually each node forms a cluster on its own. Does not require the number of clusters k in advance Needs a termination condition The final mode in both Agglomerative and Divisive in of no use. Hierarchical Clustering
51
Step 0 b d c e a a b Step 1Step 2 d e Step 3 c d e Step 4 a b c d e agglomerative Step 4 Step 3Step 2Step 1Step 0 divisive Hierarchical Clustering: Example
52
Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). connectedClustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. A Dendogram: Hierarchical Clustering
53
Demo
54
Commercial Tools IBM Intelligent Miner for Text Semio Map InXight LinguistX / ThingFinder LexiQuest ClearForest Teragram SRA NetOwl Extractor Autonomy
55
Text is tricky to process, but “ok” results are easily achieved text mining systems There exist several text mining systems e.g., D2K - Data to Knowledge http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/ Intelligence Additional Intelligence can be integrated with text mining One may play with any phase of the text mining process Summary
56
scientific and statistical text mining methods There are many other scientific and statistical text mining methods developed but not covered in this talk. http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html theoretical foundations Also, it is important to study theoretical foundations of data mining. Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.